audioX-north-v1 / README.md

Update README.md

faed2f1 verified 4 months ago

5.43 kB

	---
	license: apache-2.0
	tags:
	- automatic-speech-recognition
	- audio
	- speech
	- whisper
	- multilingual
	model-index:
	- name: Jivi-AudioX-North
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Vistaar Benchmark Hindi
	type: vistaar
	config: hindi
	split: test
	metrics:
	- name: WER
	type: wer
	value: 12.14
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Vistaar Benchmark Gujarati
	type: vistaar
	config: gujarati
	split: test
	metrics:
	- name: WER
	type: wer
	value: 18.66
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Vistaar Benchmark Marathi
	type: vistaar
	config: marathi
	split: test
	metrics:
	- name: WER
	type: wer
	value: 18.68
	language:
	- hi
	- gu
	- mr
	pipeline_tag: automatic-speech-recognition
	---
	# AudioX: Multilingual Speech-to-Text Model

	AudioX is a state-of-the-art Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It comprises two specialized variants—AudioX-North and AudioX-South—each optimized for a distinct set of Indian languages to ensure better accuracy. AudioX-North supports Hindi, Gujarati, and Marathi, while AudioX-South covers Tamil, Telugu, Kannada, and Malayalam. Trained on a combination of open-source ASR datasets and proprietary audio, the AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry-leading performance across supported languages.
	<img src="https://d3axayv063q8rp.cloudfront.net/hf_resources/audiox.png" alt="AudioX" width="600" height="600">

	## Purpose-Built for Indian Languages:
	AudioX is designed to handle diverse Indian language inputs, supporting real-world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities.

	## Training Process:
	AudioX is fine-tuned using supervised learning on top of an open-source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real-world scenarios.

	## Data Preparation:
	The model is trained on:
	- Open-source multilingual ASR corpora
	- Proprietary Indian language medical datasets,

	This hybrid approach boosts the model’s generalization across dialects and acoustic conditions.

	## Benchmarks:
	AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models.
	We evaluated AudioX on the [Vistaar Benchmark](https://github.com/AI4Bharat/vistaar/tree/master?tab=readme-ov-file) using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios.

	\| Provider \| Model \| Hindi \| Gujarati \| Marathi \| Tamil \| Telugu \| Kannada \| Malayalam \| Avg WER \|
	\|--------------\|-------------------\|--------\|----------\|---------\|-------\|--------\|---------\|------------\|----------\|
	\| Jivi AI \| AudioX \| 12.14 \| 18.66 \| 18.68 \| 21.79 \| 24.63 \| 17.61 \| 26.92 \| 20.1 \|
	\| ElevenLabs \| Scribe-v1 \| 13.64 \| 17.96 \| 16.51 \| 24.84 \| 24.89 \| 17.65 \| 28.88 \| 20.6 \|
	\| Sarvam \| saarika:v2 \| 14.28 \| 19.47 \| 18.34 \| 25.73 \| 26.80 \| 18.95 \| 32.64 \| 22.3 \|
	\| AI4Bharat \| IndicWhisper \| 13.59 \| 22.84 \| 18.25 \| 25.27 \| 28.82 \| 18.33 \| 32.34 \| 22.8 \|
	\| Microsoft \| Azure STT \| 20.03 \| 31.62 \| 27.36 \| 31.53 \| 31.38 \| 26.45 \| 41.84 \| 30.0 \|
	\| OpenAI \| gpt-4o-transcribe \| 18.65 \| 31.32 \| 25.21 \| 39.10 \| 33.94 \| 32.88 \| 46.11 \| 32.5 \|
	\| Google \| Google STT \| 23.89 \| 36.48 \| 26.48 \| 33.62 \| 42.42 \| 31.48 \| 47.90 \| 34.6 \|
	\| OpenAI \| Whisper Large v3 \| 32.00 \| 53.75 \| 78.28 \| 52.44 \| 179.58\| 67.02 \| 142.98 \| 86.6 \|


	## 🔧 Try This Model

	You can easily run inference using the 🤗 `transformers` and `librosa` libraries. Here's a minimal example to get started:

	```python
	from transformers import WhisperProcessor, WhisperForConditionalGeneration
	import librosa

	# Load model and processor
	device = "cuda"
	processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1")
	model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(device)
	model.config.forced_decoder_ids = None

	# Load and preprocess audio
	audio_path = "sample.wav"
	audio_np, sr = librosa.load(audio_path, sr=None)
	if sr != 16000:
	audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000)

	input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features

	# Generate predictions
	# Use ISO 639-1 language codes: "hi", "mr", "gu" for North; "ta", "te", "kn", "ml" for South
	# Or omit the language argument for automatic language detection
	predicted_ids = model.generate(input_features, task="transcribe", language="hi")

	# Decode predictions
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
	print(transcription)