|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- audio |
|
|
- speech |
|
|
- whisper |
|
|
- multilingual |
|
|
model-index: |
|
|
- name: Jivi-AudioX-North |
|
|
results: |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: Vistaar Benchmark Hindi |
|
|
type: vistaar |
|
|
config: hindi |
|
|
split: test |
|
|
metrics: |
|
|
- name: WER |
|
|
type: wer |
|
|
value: 12.14 |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: Vistaar Benchmark Gujarati |
|
|
type: vistaar |
|
|
config: gujarati |
|
|
split: test |
|
|
metrics: |
|
|
- name: WER |
|
|
type: wer |
|
|
value: 18.66 |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: Vistaar Benchmark Marathi |
|
|
type: vistaar |
|
|
config: marathi |
|
|
split: test |
|
|
metrics: |
|
|
- name: WER |
|
|
type: wer |
|
|
value: 18.68 |
|
|
language: |
|
|
- hi |
|
|
- gu |
|
|
- mr |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
# AudioX: Multilingual Speech-to-Text Model |
|
|
|
|
|
AudioX is a state-of-the-art Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It comprises two specialized variants—AudioX-North and AudioX-South—each optimized for a distinct set of Indian languages to ensure better accuracy. AudioX-North supports **Hindi**, **Gujarati**, and **Marathi**, while AudioX-South covers **Tamil**, **Telugu**, **Kannada**, and **Malayalam**. Trained on a combination of open-source ASR datasets and proprietary audio, the AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry-leading performance across supported languages. |
|
|
<img src="https://d3axayv063q8rp.cloudfront.net/hf_resources/audiox.png" alt="AudioX" width="600" height="600"> |
|
|
|
|
|
## Purpose-Built for Indian Languages: |
|
|
AudioX is designed to handle diverse Indian language inputs, supporting real-world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities. |
|
|
|
|
|
## Training Process: |
|
|
AudioX is fine-tuned using **supervised learning** on top of an open-source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real-world scenarios. |
|
|
|
|
|
## Data Preparation: |
|
|
The model is trained on: |
|
|
- **Open-source multilingual ASR corpora** |
|
|
- **Proprietary Indian language medical datasets**, |
|
|
|
|
|
This hybrid approach boosts the model’s generalization across dialects and acoustic conditions. |
|
|
|
|
|
## Benchmarks: |
|
|
AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models. |
|
|
We evaluated AudioX on the [Vistaar Benchmark](https://github.com/AI4Bharat/vistaar/tree/master?tab=readme-ov-file) using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios. |
|
|
|
|
|
| Provider | Model | Hindi | Gujarati | Marathi | Tamil | Telugu | Kannada | Malayalam | Avg WER | |
|
|
|--------------|-------------------|--------|----------|---------|-------|--------|---------|------------|----------| |
|
|
| **Jivi AI** | **AudioX** | **12.14** | 18.66 | 18.68 | **21.79** | **24.63** | **17.61** | **26.92** | **20.1** | |
|
|
| ElevenLabs | Scribe-v1 | 13.64 | **17.96** | **16.51** | 24.84 | 24.89 | 17.65 | 28.88 | 20.6 | |
|
|
| Sarvam | saarika:v2 | 14.28 | 19.47 | 18.34 | 25.73 | 26.80 | 18.95 | 32.64 | 22.3 | |
|
|
| AI4Bharat | IndicWhisper | 13.59 | 22.84 | 18.25 | 25.27 | 28.82 | 18.33 | 32.34 | 22.8 | |
|
|
| Microsoft | Azure STT | 20.03 | 31.62 | 27.36 | 31.53 | 31.38 | 26.45 | 41.84 | 30.0 | |
|
|
| OpenAI | gpt-4o-transcribe | 18.65 | 31.32 | 25.21 | 39.10 | 33.94 | 32.88 | 46.11 | 32.5 | |
|
|
| Google | Google STT | 23.89 | 36.48 | 26.48 | 33.62 | 42.42 | 31.48 | 47.90 | 34.6 | |
|
|
| OpenAI | Whisper Large v3 | 32.00 | 53.75 | 78.28 | 52.44 | 179.58| 67.02 | 142.98 | 86.6 | |
|
|
|
|
|
|
|
|
## 🔧 Try This Model |
|
|
|
|
|
You can easily run inference using the 🤗 `transformers` and `librosa` libraries. Here's a minimal example to get started: |
|
|
|
|
|
```python |
|
|
from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
|
import librosa |
|
|
|
|
|
# Load model and processor |
|
|
device = "cuda" |
|
|
processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1") |
|
|
model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(device) |
|
|
model.config.forced_decoder_ids = None |
|
|
|
|
|
# Load and preprocess audio |
|
|
audio_path = "sample.wav" |
|
|
audio_np, sr = librosa.load(audio_path, sr=None) |
|
|
if sr != 16000: |
|
|
audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000) |
|
|
|
|
|
input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features |
|
|
|
|
|
# Generate predictions |
|
|
# Use ISO 639-1 language codes: "hi", "mr", "gu" for North; "ta", "te", "kn", "ml" for South |
|
|
# Or omit the language argument for automatic language detection |
|
|
predicted_ids = model.generate(input_features, task="transcribe", language="hi") |
|
|
|
|
|
# Decode predictions |
|
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
|
print(transcription) |