---
title: Arabic Tokenizer Arena
emoji: 🏟️
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: "5.9.1"
app_file: app.py
pinned: false
---

# 🏟️ Arabic Tokenizer Arena Pro

Advanced research & production platform for Arabic tokenization analysis.

## Features

- 📊 **Comprehensive Metrics**: Fertility, compression, STRR, OOV rate, and more
- 🌍 **Arabic-Specific Analysis**: Dialect support, diacritic preservation
- ⚖️ **Side-by-Side Comparison**: Compare multiple tokenizers instantly
- 🎨 **Beautiful Visualization**: Token-by-token display with IDs
- 🏆 **Leaderboard**: Evaluate on real HuggingFace Arabic datasets
- 📖 **Multi-Variant Support**: MSA, dialectal, and Classical Arabic

## Project Structure

```
arabic_tokenizer_arena/
├── app.py                 # Main Gradio application
├── config.py              # Tokenizer registry & dataset configs
├── tokenizer_manager.py   # Tokenizer loading & caching
├── analysis.py            # Tokenization analysis functions
├── leaderboard.py         # Leaderboard with HF datasets
├── ui_components.py       # HTML generation
├── styles.py              # CSS styles
├── utils.py               # Arabic text utilities
├── requirements.txt       # Dependencies
└── README.md              # This file
```

## Installation

```bash
pip install -r requirements.txt
```

## Usage

### Local Development
```bash
python app.py
```

### HuggingFace Spaces
1. Upload all `.py` files to your Space
2. Add `HF_TOKEN` secret if using gated models
3. The app will start automatically

## Available Tokenizers

### Arabic BERT Models
- AraBERT v2 (AUB MIND Lab)
- CAMeLBERT Mix/MSA/DA/CA (CAMeL Lab)
- MARBERT & ARBERT (UBC NLP)

### Arabic LLMs
- Jais 13B/30B (Inception/MBZUAI)
- SILMA 9B (SILMA AI)
- Fanar 9B (QCRI)
- Yehia 7B (Navid AI)
- Atlas-Chat (MBZUAI Paris)

### Arabic Tokenizers
- Aranizer PBE/SP 32K/86K (RIOTU Lab)

### Multilingual Models
- Qwen 2.5 (Alibaba)
- Gemma 2 (Google)
- Mistral (Mistral AI)
- XLM-RoBERTa (Meta)

## Leaderboard Datasets

| Dataset | Source | Category |
|---------|--------|----------|
| ArabicMMLU | MBZUAI | MSA Benchmark |
| ArSenTD-LEV | ramybaly | Levantine Dialect |
| ATHAR | mohamed-khalil | Classical Arabic |
| ARCD | arcd | QA Dataset |
| Ashaar | arbml | Poetry |
| Hadith | gurgutan | Religious |
| Arabic Sentiment | arbml | Social Media |
| SANAD | arbml | News |

## Metrics

- **Fertility**: Tokens per word (lower = better, 1.0 ideal)
- **Compression**: Bytes per token (higher = better)
- **STRR**: Single Token Retention Rate (higher = better)
- **OOV Rate**: Out-of-vocabulary percentage (lower = better)

## License

MIT License

## Contributing

Contributions welcome! Please open an issue or PR.

---

Built with ❤️ for the Arabic NLP community