--- title: Arabic Tokenizer Arena emoji: 🏟️ colorFrom: green colorTo: blue sdk: gradio sdk_version: "5.9.1" app_file: app.py pinned: false --- # 🏟️ Arabic Tokenizer Arena Pro Advanced research & production platform for Arabic tokenization analysis. ## Features - 📊 **Comprehensive Metrics**: Fertility, compression, STRR, OOV rate, and more - 🌍 **Arabic-Specific Analysis**: Dialect support, diacritic preservation - ⚖️ **Side-by-Side Comparison**: Compare multiple tokenizers instantly - 🎨 **Beautiful Visualization**: Token-by-token display with IDs - 🏆 **Leaderboard**: Evaluate on real HuggingFace Arabic datasets - 📖 **Multi-Variant Support**: MSA, dialectal, and Classical Arabic ## Project Structure ``` arabic_tokenizer_arena/ ├── app.py # Main Gradio application ├── config.py # Tokenizer registry & dataset configs ├── tokenizer_manager.py # Tokenizer loading & caching ├── analysis.py # Tokenization analysis functions ├── leaderboard.py # Leaderboard with HF datasets ├── ui_components.py # HTML generation ├── styles.py # CSS styles ├── utils.py # Arabic text utilities ├── requirements.txt # Dependencies └── README.md # This file ``` ## Installation ```bash pip install -r requirements.txt ``` ## Usage ### Local Development ```bash python app.py ``` ### HuggingFace Spaces 1. Upload all `.py` files to your Space 2. Add `HF_TOKEN` secret if using gated models 3. The app will start automatically ## Available Tokenizers ### Arabic BERT Models - AraBERT v2 (AUB MIND Lab) - CAMeLBERT Mix/MSA/DA/CA (CAMeL Lab) - MARBERT & ARBERT (UBC NLP) ### Arabic LLMs - Jais 13B/30B (Inception/MBZUAI) - SILMA 9B (SILMA AI) - Fanar 9B (QCRI) - Yehia 7B (Navid AI) - Atlas-Chat (MBZUAI Paris) ### Arabic Tokenizers - Aranizer PBE/SP 32K/86K (RIOTU Lab) ### Multilingual Models - Qwen 2.5 (Alibaba) - Gemma 2 (Google) - Mistral (Mistral AI) - XLM-RoBERTa (Meta) ## Leaderboard Datasets | Dataset | Source | Category | |---------|--------|----------| | ArabicMMLU | MBZUAI | MSA Benchmark | | ArSenTD-LEV | ramybaly | Levantine Dialect | | ATHAR | mohamed-khalil | Classical Arabic | | ARCD | arcd | QA Dataset | | Ashaar | arbml | Poetry | | Hadith | gurgutan | Religious | | Arabic Sentiment | arbml | Social Media | | SANAD | arbml | News | ## Metrics - **Fertility**: Tokens per word (lower = better, 1.0 ideal) - **Compression**: Bytes per token (higher = better) - **STRR**: Single Token Retention Rate (higher = better) - **OOV Rate**: Out-of-vocabulary percentage (lower = better) ## License MIT License ## Contributing Contributions welcome! Please open an issue or PR. --- Built with ❤️ for the Arabic NLP community