audioX-north-v1 / README.md
tigivij's picture
Update README.md
faed2f1 verified
---
license: apache-2.0
tags:
- automatic-speech-recognition
- audio
- speech
- whisper
- multilingual
model-index:
- name: Jivi-AudioX-North
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Vistaar Benchmark Hindi
type: vistaar
config: hindi
split: test
metrics:
- name: WER
type: wer
value: 12.14
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Vistaar Benchmark Gujarati
type: vistaar
config: gujarati
split: test
metrics:
- name: WER
type: wer
value: 18.66
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Vistaar Benchmark Marathi
type: vistaar
config: marathi
split: test
metrics:
- name: WER
type: wer
value: 18.68
language:
- hi
- gu
- mr
pipeline_tag: automatic-speech-recognition
---
# AudioX: Multilingual Speech-to-Text Model
AudioX is a state-of-the-art Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It comprises two specialized variants—AudioX-North and AudioX-South—each optimized for a distinct set of Indian languages to ensure better accuracy. AudioX-North supports **Hindi**, **Gujarati**, and **Marathi**, while AudioX-South covers **Tamil**, **Telugu**, **Kannada**, and **Malayalam**. Trained on a combination of open-source ASR datasets and proprietary audio, the AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry-leading performance across supported languages.
<img src="https://d3axayv063q8rp.cloudfront.net/hf_resources/audiox.png" alt="AudioX" width="600" height="600">
## Purpose-Built for Indian Languages:
AudioX is designed to handle diverse Indian language inputs, supporting real-world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities.
## Training Process:
AudioX is fine-tuned using **supervised learning** on top of an open-source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real-world scenarios.
## Data Preparation:
The model is trained on:
- **Open-source multilingual ASR corpora**
- **Proprietary Indian language medical datasets**,
This hybrid approach boosts the model’s generalization across dialects and acoustic conditions.
## Benchmarks:
AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models.
We evaluated AudioX on the [Vistaar Benchmark](https://github.com/AI4Bharat/vistaar/tree/master?tab=readme-ov-file) using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios.
| Provider | Model | Hindi | Gujarati | Marathi | Tamil | Telugu | Kannada | Malayalam | Avg WER |
|--------------|-------------------|--------|----------|---------|-------|--------|---------|------------|----------|
| **Jivi AI** | **AudioX** | **12.14** | 18.66 | 18.68 | **21.79** | **24.63** | **17.61** | **26.92** | **20.1** |
| ElevenLabs | Scribe-v1 | 13.64 | **17.96** | **16.51** | 24.84 | 24.89 | 17.65 | 28.88 | 20.6 |
| Sarvam | saarika:v2 | 14.28 | 19.47 | 18.34 | 25.73 | 26.80 | 18.95 | 32.64 | 22.3 |
| AI4Bharat | IndicWhisper | 13.59 | 22.84 | 18.25 | 25.27 | 28.82 | 18.33 | 32.34 | 22.8 |
| Microsoft | Azure STT | 20.03 | 31.62 | 27.36 | 31.53 | 31.38 | 26.45 | 41.84 | 30.0 |
| OpenAI | gpt-4o-transcribe | 18.65 | 31.32 | 25.21 | 39.10 | 33.94 | 32.88 | 46.11 | 32.5 |
| Google | Google STT | 23.89 | 36.48 | 26.48 | 33.62 | 42.42 | 31.48 | 47.90 | 34.6 |
| OpenAI | Whisper Large v3 | 32.00 | 53.75 | 78.28 | 52.44 | 179.58| 67.02 | 142.98 | 86.6 |
## 🔧 Try This Model
You can easily run inference using the 🤗 `transformers` and `librosa` libraries. Here's a minimal example to get started:
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
# Load model and processor
device = "cuda"
processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1")
model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(device)
model.config.forced_decoder_ids = None
# Load and preprocess audio
audio_path = "sample.wav"
audio_np, sr = librosa.load(audio_path, sr=None)
if sr != 16000:
audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000)
input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features
# Generate predictions
# Use ISO 639-1 language codes: "hi", "mr", "gu" for North; "ta", "te", "kn", "ml" for South
# Or omit the language argument for automatic language detection
predicted_ids = model.generate(input_features, task="transcribe", language="hi")
# Decode predictions
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)