HeshamHaroon Claude commited on
Commit
f32d4c7
·
1 Parent(s): f2a2081

Refactor: modularize codebase into separate modules

Browse files

- Split monolithic app.py into logical modules:
- config.py: tokenizer registry, datasets, sample texts
- tokenizer_manager.py: tokenizer loading and caching
- analysis.py: tokenization analysis functions
- leaderboard.py: HF dataset evaluation
- utils.py: Arabic text utilities
- styles.py: CSS styles
- ui_components.py: HTML generation
- Add .gitignore for Python/Gradio
- Add __init__.py for package structure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

Files changed (12) hide show
  1. .gitignore +38 -0
  2. README.md +99 -10
  3. __init__.py +8 -0
  4. analysis.py +244 -0
  5. app.py +52 -1853
  6. config.py +551 -0
  7. leaderboard.py +449 -0
  8. requirements.txt +7 -1
  9. styles.py +526 -0
  10. tokenizer_manager.py +86 -0
  11. ui_components.py +280 -0
  12. utils.py +56 -0
.gitignore ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ env/
8
+ venv/
9
+ .venv/
10
+ *.egg-info/
11
+ dist/
12
+ build/
13
+
14
+ # IDE
15
+ .vscode/
16
+ .idea/
17
+ *.swp
18
+ *.swo
19
+
20
+ # Environment
21
+ .env
22
+ .env.local
23
+
24
+ # Logs
25
+ *.log
26
+ logs/
27
+
28
+ # Cache
29
+ .cache/
30
+ *.cache
31
+ .gradio/
32
+
33
+ # OS
34
+ .DS_Store
35
+ Thumbs.db
36
+
37
+ # HuggingFace
38
+ .huggingface/
README.md CHANGED
@@ -1,12 +1,101 @@
1
- ---
2
- title: Token
3
- emoji: 🐠
4
- colorFrom: purple
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 6.0.1
8
- app_file: app.py
9
- pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # 🏟️ Arabic Tokenizer Arena Pro
2
+
3
+ Advanced research & production platform for Arabic tokenization analysis.
4
+
5
+ ## Features
6
+
7
+ - 📊 **Comprehensive Metrics**: Fertility, compression, STRR, OOV rate, and more
8
+ - 🌍 **Arabic-Specific Analysis**: Dialect support, diacritic preservation
9
+ - ⚖️ **Side-by-Side Comparison**: Compare multiple tokenizers instantly
10
+ - 🎨 **Beautiful Visualization**: Token-by-token display with IDs
11
+ - 🏆 **Leaderboard**: Evaluate on real HuggingFace Arabic datasets
12
+ - 📖 **Multi-Variant Support**: MSA, dialectal, and Classical Arabic
13
+
14
+ ## Project Structure
15
+
16
+ ```
17
+ arabic_tokenizer_arena/
18
+ ├── app.py # Main Gradio application
19
+ ├── config.py # Tokenizer registry & dataset configs
20
+ ├── tokenizer_manager.py # Tokenizer loading & caching
21
+ ├── analysis.py # Tokenization analysis functions
22
+ ├── leaderboard.py # Leaderboard with HF datasets
23
+ ├── ui_components.py # HTML generation
24
+ ├── styles.py # CSS styles
25
+ ├── utils.py # Arabic text utilities
26
+ ├── requirements.txt # Dependencies
27
+ └── README.md # This file
28
+ ```
29
+
30
+ ## Installation
31
+
32
+ ```bash
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ ## Usage
37
+
38
+ ### Local Development
39
+ ```bash
40
+ python app.py
41
+ ```
42
+
43
+ ### HuggingFace Spaces
44
+ 1. Upload all `.py` files to your Space
45
+ 2. Add `HF_TOKEN` secret if using gated models
46
+ 3. The app will start automatically
47
+
48
+ ## Available Tokenizers
49
+
50
+ ### Arabic BERT Models
51
+ - AraBERT v2 (AUB MIND Lab)
52
+ - CAMeLBERT Mix/MSA/DA/CA (CAMeL Lab)
53
+ - MARBERT & ARBERT (UBC NLP)
54
+
55
+ ### Arabic LLMs
56
+ - Jais 13B/30B (Inception/MBZUAI)
57
+ - SILMA 9B (SILMA AI)
58
+ - Fanar 9B (QCRI)
59
+ - Yehia 7B (Navid AI)
60
+ - Atlas-Chat (MBZUAI Paris)
61
+
62
+ ### Arabic Tokenizers
63
+ - Aranizer PBE/SP 32K/86K (RIOTU Lab)
64
+
65
+ ### Multilingual Models
66
+ - Qwen 2.5 (Alibaba)
67
+ - Gemma 2 (Google)
68
+ - Mistral (Mistral AI)
69
+ - XLM-RoBERTa (Meta)
70
+
71
+ ## Leaderboard Datasets
72
+
73
+ | Dataset | Source | Category |
74
+ |---------|--------|----------|
75
+ | ArabicMMLU | MBZUAI | MSA Benchmark |
76
+ | ArSenTD-LEV | ramybaly | Levantine Dialect |
77
+ | ATHAR | mohamed-khalil | Classical Arabic |
78
+ | ARCD | arcd | QA Dataset |
79
+ | Ashaar | arbml | Poetry |
80
+ | Hadith | gurgutan | Religious |
81
+ | Arabic Sentiment | arbml | Social Media |
82
+ | SANAD | arbml | News |
83
+
84
+ ## Metrics
85
+
86
+ - **Fertility**: Tokens per word (lower = better, 1.0 ideal)
87
+ - **Compression**: Bytes per token (higher = better)
88
+ - **STRR**: Single Token Retention Rate (higher = better)
89
+ - **OOV Rate**: Out-of-vocabulary percentage (lower = better)
90
+
91
+ ## License
92
+
93
+ MIT License
94
+
95
+ ## Contributing
96
+
97
+ Contributions welcome! Please open an issue or PR.
98
+
99
  ---
100
 
101
+ Built with ❤️ for the Arabic NLP community
__init__.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Arabic Tokenizer Arena Pro
3
+ ==========================
4
+ A comprehensive platform for evaluating Arabic tokenizers
5
+ """
6
+
7
+ __version__ = "2.0.0"
8
+ __author__ = "Arabic NLP Community"
analysis.py ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tokenization Analysis
3
+ =====================
4
+ Core analysis functions for evaluating tokenizers
5
+ """
6
+
7
+ import time
8
+ from typing import Tuple
9
+ from config import TokenizerInfo, TokenizationMetrics
10
+ from utils import count_arabic_chars, get_arabic_words, has_diacritics, is_arabic_char
11
+ from tokenizer_manager import tokenizer_manager
12
+
13
+
14
+ def analyze_tokenization(
15
+ text: str,
16
+ model_id: str,
17
+ tokenizer_info: TokenizerInfo
18
+ ) -> TokenizationMetrics:
19
+ """Perform comprehensive tokenization analysis"""
20
+
21
+ tokenizer = tokenizer_manager.get_tokenizer(model_id)
22
+
23
+ # Time the tokenization
24
+ start_time = time.perf_counter()
25
+ tokens = tokenizer.tokenize(text)
26
+ token_ids = tokenizer.encode(text, add_special_tokens=False)
27
+ tokenization_time = (time.perf_counter() - start_time) * 1000
28
+
29
+ decoded = tokenizer.decode(token_ids, skip_special_tokens=True)
30
+
31
+ # Basic counts
32
+ words = text.split()
33
+ total_words = len(words)
34
+ total_tokens = len(tokens)
35
+ total_characters = len(text)
36
+ total_bytes = len(text.encode('utf-8'))
37
+
38
+ # Efficiency metrics
39
+ fertility = total_tokens / max(total_words, 1)
40
+ compression_ratio = total_bytes / max(total_tokens, 1)
41
+ char_per_token = total_characters / max(total_tokens, 1)
42
+
43
+ # OOV analysis
44
+ unk_token = tokenizer.unk_token if hasattr(tokenizer, 'unk_token') else '[UNK]'
45
+ oov_count = sum(1 for t in tokens if t == unk_token or '[UNK]' in str(t))
46
+ oov_percentage = (oov_count / max(total_tokens, 1)) * 100
47
+
48
+ # Single Token Retention Rate (STRR)
49
+ single_token_words = 0
50
+ subwords_per_word = []
51
+
52
+ for word in words:
53
+ word_tokens = tokenizer.tokenize(word)
54
+ subwords_per_word.append(len(word_tokens))
55
+ if len(word_tokens) == 1:
56
+ single_token_words += 1
57
+
58
+ strr = single_token_words / max(total_words, 1)
59
+ avg_subwords = sum(subwords_per_word) / max(len(subwords_per_word), 1)
60
+ max_subwords = max(subwords_per_word) if subwords_per_word else 0
61
+ continued_ratio = (total_words - single_token_words) / max(total_words, 1)
62
+
63
+ # Arabic-specific metrics
64
+ arabic_char_count = count_arabic_chars(text)
65
+ arabic_words = get_arabic_words(text)
66
+ arabic_tokens_count = 0
67
+
68
+ for token in tokens:
69
+ if any(is_arabic_char(c) for c in str(token)):
70
+ arabic_tokens_count += 1
71
+
72
+ arabic_fertility = arabic_tokens_count / max(len(arabic_words), 1) if arabic_words else 0
73
+ diacritic_preserved = has_diacritics(text) == has_diacritics(decoded)
74
+
75
+ return TokenizationMetrics(
76
+ total_tokens=total_tokens,
77
+ total_words=total_words,
78
+ total_characters=total_characters,
79
+ total_bytes=total_bytes,
80
+ fertility=fertility,
81
+ compression_ratio=compression_ratio,
82
+ char_per_token=char_per_token,
83
+ oov_count=oov_count,
84
+ oov_percentage=oov_percentage,
85
+ single_token_words=single_token_words,
86
+ single_token_retention_rate=strr,
87
+ avg_subwords_per_word=avg_subwords,
88
+ max_subwords_per_word=max_subwords,
89
+ continued_words_ratio=continued_ratio,
90
+ arabic_char_count=arabic_char_count,
91
+ arabic_token_count=arabic_tokens_count,
92
+ arabic_fertility=arabic_fertility,
93
+ diacritic_preservation=diacritic_preserved,
94
+ tokenization_time_ms=tokenization_time,
95
+ tokens=tokens,
96
+ token_ids=token_ids,
97
+ decoded_text=decoded
98
+ )
99
+
100
+
101
+ def analyze_single_tokenizer(tokenizer_choice: str, text: str) -> Tuple[str, str, str, str]:
102
+ """Analyze a single tokenizer - returns HTML outputs"""
103
+ from ui_components import (
104
+ generate_tokenizer_info_card,
105
+ generate_metrics_card,
106
+ generate_token_visualization,
107
+ generate_decoded_section
108
+ )
109
+
110
+ if not text or not text.strip():
111
+ return (
112
+ '<div class="warning">⚠️ Please enter some text to analyze</div>',
113
+ '', '', ''
114
+ )
115
+
116
+ if not tokenizer_choice:
117
+ return (
118
+ '<div class="warning">⚠️ Please select a tokenizer</div>',
119
+ '', '', ''
120
+ )
121
+
122
+ model_id = tokenizer_manager.get_model_id_from_choice(tokenizer_choice)
123
+ tokenizer_info = tokenizer_manager.get_available_tokenizers().get(model_id)
124
+
125
+ if not tokenizer_info:
126
+ return (
127
+ '<div class="error-card"><h4>Error</h4><p>Tokenizer not found</p></div>',
128
+ '', '', ''
129
+ )
130
+
131
+ try:
132
+ metrics = analyze_tokenization(text, model_id, tokenizer_info)
133
+
134
+ info_html = generate_tokenizer_info_card(tokenizer_info)
135
+ metrics_html = generate_metrics_card(metrics, tokenizer_info)
136
+ tokens_html = generate_token_visualization(metrics.tokens, metrics.token_ids)
137
+ decoded_html = generate_decoded_section(metrics)
138
+
139
+ return info_html, metrics_html, tokens_html, decoded_html
140
+
141
+ except Exception as e:
142
+ return (
143
+ f'<div class="error-card"><h4>Error</h4><p>{str(e)}</p></div>',
144
+ '', '', ''
145
+ )
146
+
147
+
148
+ def compare_tokenizers(tokenizer_choices: list, text: str) -> str:
149
+ """Compare multiple tokenizers - returns HTML table"""
150
+ from config import TokenizationMetrics
151
+
152
+ if not text or not text.strip():
153
+ return '<div class="warning">⚠️ Please enter some text to analyze</div>'
154
+
155
+ if not tokenizer_choices or len(tokenizer_choices) < 2:
156
+ return '<div class="warning">⚠️ Please select at least 2 tokenizers to compare</div>'
157
+
158
+ results = []
159
+
160
+ for choice in tokenizer_choices:
161
+ model_id = tokenizer_manager.get_model_id_from_choice(choice)
162
+ tokenizer_info = tokenizer_manager.get_available_tokenizers().get(model_id)
163
+
164
+ if tokenizer_info:
165
+ try:
166
+ metrics = analyze_tokenization(text, model_id, tokenizer_info)
167
+ results.append({
168
+ 'name': tokenizer_info.name,
169
+ 'org': tokenizer_info.organization,
170
+ 'type': tokenizer_info.type.value,
171
+ 'metrics': metrics
172
+ })
173
+ except Exception as e:
174
+ results.append({
175
+ 'name': tokenizer_info.name,
176
+ 'org': tokenizer_info.organization,
177
+ 'type': tokenizer_info.type.value,
178
+ 'error': str(e)
179
+ })
180
+
181
+ # Sort by fertility (lower is better)
182
+ def get_fertility(x):
183
+ if 'error' in x:
184
+ return 999
185
+ return x['metrics'].fertility
186
+
187
+ results.sort(key=get_fertility)
188
+
189
+ # Generate comparison table
190
+ html = '''
191
+ <div class="comparison-container">
192
+ <table class="comparison-table">
193
+ <thead>
194
+ <tr>
195
+ <th>Rank</th>
196
+ <th>Tokenizer</th>
197
+ <th>Type</th>
198
+ <th>Tokens</th>
199
+ <th>Fertility ↓</th>
200
+ <th>Compression ↑</th>
201
+ <th>STRR ↑</th>
202
+ <th>OOV %</th>
203
+ </tr>
204
+ </thead>
205
+ <tbody>
206
+ '''
207
+
208
+ for i, result in enumerate(results):
209
+ rank = i + 1
210
+ rank_class = 'rank-1' if rank == 1 else 'rank-2' if rank == 2 else 'rank-3' if rank == 3 else ''
211
+
212
+ if 'error' in result:
213
+ html += f'''
214
+ <tr class="{rank_class}">
215
+ <td>#{rank}</td>
216
+ <td><strong>{result['name']}</strong><br><small>{result['org']}</small></td>
217
+ <td>{result['type']}</td>
218
+ <td colspan="5" class="error">Error: {result['error']}</td>
219
+ </tr>
220
+ '''
221
+ else:
222
+ m = result['metrics']
223
+ fertility_class = 'excellent' if m.fertility < 1.5 else 'good' if m.fertility < 2.5 else 'poor'
224
+
225
+ html += f'''
226
+ <tr class="{rank_class}">
227
+ <td><strong>#{rank}</strong></td>
228
+ <td><strong>{result['name']}</strong><br><small>{result['org']}</small></td>
229
+ <td>{result['type']}</td>
230
+ <td>{m.total_tokens}</td>
231
+ <td class="{fertility_class}">{m.fertility:.3f}</td>
232
+ <td>{m.compression_ratio:.2f}</td>
233
+ <td>{m.single_token_retention_rate:.1%}</td>
234
+ <td>{m.oov_percentage:.1f}%</td>
235
+ </tr>
236
+ '''
237
+
238
+ html += '''
239
+ </tbody>
240
+ </table>
241
+ </div>
242
+ '''
243
+
244
+ return html
app.py CHANGED
@@ -1,1819 +1,38 @@
1
  """
2
- Arabic Tokenizer Arena Pro - Advanced Arabic Tokenization Analysis Platform
3
- ============================================================================
4
- A comprehensive research and production-grade tool for evaluating Arabic tokenizers
5
- across multiple dimensions: efficiency, coverage, morphological awareness, and more.
6
 
7
- Now with LEADERBOARD - imports real Arabic datasets from HuggingFace!
8
-
9
- Supports:
10
- - Arabic-specific tokenizers (Aranizer, AraBERT, CAMeLBERT, MARBERT, etc.)
11
- - Major LLM tokenizers (Jais, AceGPT, Falcon-Arabic, ALLaM, Qwen, Llama, Mistral, GPT)
12
- - Comprehensive evaluation metrics based on latest research
13
- - Real dataset benchmarking from HuggingFace
14
  """
15
 
16
  import gradio as gr
17
- import json
18
- import re
19
- import time
20
- import unicodedata
21
- from typing import Dict, List, Tuple, Optional, Any
22
- from dataclasses import dataclass, field
23
- from enum import Enum
24
- from collections import defaultdict
25
- import statistics
26
- import os
27
-
28
- # Hugging Face authentication
29
- HF_TOKEN = os.getenv('HF_TOKEN')
30
- if HF_TOKEN:
31
- HF_TOKEN = HF_TOKEN.strip()
32
- from huggingface_hub import login
33
- login(token=HF_TOKEN)
34
-
35
- from transformers import AutoTokenizer, logging
36
- logging.set_verbosity_error()
37
-
38
- # Import datasets library for leaderboard
39
- from datasets import load_dataset
40
-
41
- # ============================================================================
42
- # DATA CLASSES AND ENUMS
43
- # ============================================================================
44
-
45
- class TokenizerType(Enum):
46
- ARABIC_SPECIFIC = "Arabic-Specific"
47
- MULTILINGUAL_LLM = "Multilingual LLM"
48
- ARABIC_LLM = "Arabic LLM"
49
- ENCODER_ONLY = "Encoder-Only (BERT)"
50
- DECODER_ONLY = "Decoder-Only (GPT)"
51
-
52
- class TokenizerAlgorithm(Enum):
53
- BPE = "Byte-Pair Encoding (BPE)"
54
- BBPE = "Byte-Level BPE"
55
- WORDPIECE = "WordPiece"
56
- SENTENCEPIECE = "SentencePiece"
57
- UNIGRAM = "Unigram"
58
- TIKTOKEN = "Tiktoken"
59
-
60
- @dataclass
61
- class TokenizerInfo:
62
- """Metadata about a tokenizer"""
63
- name: str
64
- model_id: str
65
- type: TokenizerType
66
- algorithm: TokenizerAlgorithm
67
- vocab_size: int
68
- description: str
69
- organization: str
70
- arabic_support: str # Native, Adapted, Limited
71
- dialect_support: List[str] = field(default_factory=list)
72
- special_features: List[str] = field(default_factory=list)
73
-
74
- @dataclass
75
- class TokenizationMetrics:
76
- """Comprehensive tokenization evaluation metrics"""
77
- # Basic counts
78
- total_tokens: int
79
- total_words: int
80
- total_characters: int
81
- total_bytes: int
82
-
83
- # Efficiency metrics
84
- fertility: float # tokens per word (lower is better, 1.0 is ideal)
85
- compression_ratio: float # bytes per token (higher is better)
86
- char_per_token: float # characters per token
87
-
88
- # Coverage metrics
89
- oov_count: int # out-of-vocabulary tokens (UNK)
90
- oov_percentage: float
91
- single_token_words: int # words tokenized as single token
92
- single_token_retention_rate: float # STRR metric
93
-
94
- # Morphological metrics
95
- avg_subwords_per_word: float
96
- max_subwords_per_word: int
97
- continued_words_ratio: float # words split into multiple tokens
98
-
99
- # Arabic-specific metrics
100
- arabic_char_count: int
101
- arabic_token_count: int
102
- arabic_fertility: float
103
- diacritic_preservation: bool
104
-
105
- # Performance metrics
106
- tokenization_time_ms: float
107
-
108
- # Token details
109
- tokens: List[str] = field(default_factory=list)
110
- token_ids: List[int] = field(default_factory=list)
111
- decoded_text: str = ""
112
-
113
- # ============================================================================
114
- # TOKENIZER REGISTRY - Comprehensive list of Arabic tokenizers
115
- # ============================================================================
116
 
117
- TOKENIZER_REGISTRY: Dict[str, TokenizerInfo] = {
118
- # ========== ARABIC-SPECIFIC BERT MODELS ==========
119
- "aubmindlab/bert-base-arabertv2": TokenizerInfo(
120
- name="AraBERT v2",
121
- model_id="aubmindlab/bert-base-arabertv2",
122
- type=TokenizerType.ENCODER_ONLY,
123
- algorithm=TokenizerAlgorithm.WORDPIECE,
124
- vocab_size=64000,
125
- description="Arabic BERT with Farasa segmentation, optimized for MSA",
126
- organization="AUB MIND Lab",
127
- arabic_support="Native",
128
- dialect_support=["MSA"],
129
- special_features=["Farasa preprocessing", "Morphological segmentation"]
130
- ),
131
- "aubmindlab/bert-large-arabertv2": TokenizerInfo(
132
- name="AraBERT v2 Large",
133
- model_id="aubmindlab/bert-large-arabertv2",
134
- type=TokenizerType.ENCODER_ONLY,
135
- algorithm=TokenizerAlgorithm.WORDPIECE,
136
- vocab_size=64000,
137
- description="Large Arabic BERT with enhanced capacity",
138
- organization="AUB MIND Lab",
139
- arabic_support="Native",
140
- dialect_support=["MSA"],
141
- special_features=["Large model", "Farasa preprocessing"]
142
- ),
143
- "CAMeL-Lab/bert-base-arabic-camelbert-mix": TokenizerInfo(
144
- name="CAMeLBERT Mix",
145
- model_id="CAMeL-Lab/bert-base-arabic-camelbert-mix",
146
- type=TokenizerType.ENCODER_ONLY,
147
- algorithm=TokenizerAlgorithm.WORDPIECE,
148
- vocab_size=30000,
149
- description="Pre-trained on MSA, DA, and Classical Arabic mix",
150
- organization="CAMeL Lab NYU Abu Dhabi",
151
- arabic_support="Native",
152
- dialect_support=["MSA", "DA", "CA"],
153
- special_features=["Multi-variant Arabic", "Classical Arabic support"]
154
- ),
155
- "CAMeL-Lab/bert-base-arabic-camelbert-msa": TokenizerInfo(
156
- name="CAMeLBERT MSA",
157
- model_id="CAMeL-Lab/bert-base-arabic-camelbert-msa",
158
- type=TokenizerType.ENCODER_ONLY,
159
- algorithm=TokenizerAlgorithm.WORDPIECE,
160
- vocab_size=30000,
161
- description="Specialized for Modern Standard Arabic",
162
- organization="CAMeL Lab NYU Abu Dhabi",
163
- arabic_support="Native",
164
- dialect_support=["MSA"],
165
- special_features=["MSA optimized"]
166
- ),
167
- "CAMeL-Lab/bert-base-arabic-camelbert-da": TokenizerInfo(
168
- name="CAMeLBERT DA",
169
- model_id="CAMeL-Lab/bert-base-arabic-camelbert-da",
170
- type=TokenizerType.ENCODER_ONLY,
171
- algorithm=TokenizerAlgorithm.WORDPIECE,
172
- vocab_size=30000,
173
- description="Specialized for Dialectal Arabic",
174
- organization="CAMeL Lab NYU Abu Dhabi",
175
- arabic_support="Native",
176
- dialect_support=["Egyptian", "Gulf", "Levantine", "Maghrebi"],
177
- special_features=["Dialect optimized"]
178
- ),
179
- "CAMeL-Lab/bert-base-arabic-camelbert-ca": TokenizerInfo(
180
- name="CAMeLBERT CA",
181
- model_id="CAMeL-Lab/bert-base-arabic-camelbert-ca",
182
- type=TokenizerType.ENCODER_ONLY,
183
- algorithm=TokenizerAlgorithm.WORDPIECE,
184
- vocab_size=30000,
185
- description="Specialized for Classical Arabic",
186
- organization="CAMeL Lab NYU Abu Dhabi",
187
- arabic_support="Native",
188
- dialect_support=["Classical"],
189
- special_features=["Classical Arabic", "Religious texts"]
190
- ),
191
- "UBC-NLP/MARBERT": TokenizerInfo(
192
- name="MARBERT",
193
- model_id="UBC-NLP/MARBERT",
194
- type=TokenizerType.ENCODER_ONLY,
195
- algorithm=TokenizerAlgorithm.WORDPIECE,
196
- vocab_size=100000,
197
- description="Multi-dialectal Arabic BERT trained on Twitter data",
198
- organization="UBC NLP",
199
- arabic_support="Native",
200
- dialect_support=["MSA", "Egyptian", "Gulf", "Levantine", "Maghrebi"],
201
- special_features=["Twitter data", "100K vocabulary", "Multi-dialect"]
202
- ),
203
- "UBC-NLP/ARBERT": TokenizerInfo(
204
- name="ARBERT",
205
- model_id="UBC-NLP/ARBERT",
206
- type=TokenizerType.ENCODER_ONLY,
207
- algorithm=TokenizerAlgorithm.WORDPIECE,
208
- vocab_size=100000,
209
- description="Arabic BERT focused on MSA with large vocabulary",
210
- organization="UBC NLP",
211
- arabic_support="Native",
212
- dialect_support=["MSA"],
213
- special_features=["100K vocabulary", "MSA focused"]
214
- ),
215
-
216
- # ========== ARABIC-SPECIFIC TOKENIZERS ==========
217
- "riotu-lab/Aranizer-PBE-86k": TokenizerInfo(
218
- name="Aranizer PBE 86K",
219
- model_id="riotu-lab/Aranizer-PBE-86k",
220
- type=TokenizerType.ARABIC_SPECIFIC,
221
- algorithm=TokenizerAlgorithm.BPE,
222
- vocab_size=86000,
223
- description="Pair Byte Encoding tokenizer optimized for Arabic LLMs",
224
- organization="RIOTU Lab",
225
- arabic_support="Native",
226
- dialect_support=["MSA"],
227
- special_features=["Low fertility", "LLM optimized", "86K vocab"]
228
- ),
229
- "riotu-lab/Aranizer-SP-86k": TokenizerInfo(
230
- name="Aranizer SP 86K",
231
- model_id="riotu-lab/Aranizer-SP-86k",
232
- type=TokenizerType.ARABIC_SPECIFIC,
233
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
234
- vocab_size=86000,
235
- description="SentencePiece tokenizer optimized for Arabic",
236
- organization="RIOTU Lab",
237
- arabic_support="Native",
238
- dialect_support=["MSA"],
239
- special_features=["Low fertility", "SentencePiece", "86K vocab"]
240
- ),
241
-
242
- # ========== ARABIC-SPECIFIC LLMs ==========
243
- "ALLaM-AI/ALLaM-7B-Instruct-preview": TokenizerInfo(
244
- name="ALLaM 7B Instruct",
245
- model_id="ALLaM-AI/ALLaM-7B-Instruct-preview",
246
- type=TokenizerType.ARABIC_LLM,
247
- algorithm=TokenizerAlgorithm.BPE,
248
- vocab_size=128000,
249
- description="Saudi Arabia's flagship Arabic LLM by SDAIA, SOTA on Arabic MMLU",
250
- organization="SDAIA (Saudi Arabia)",
251
- arabic_support="Native",
252
- dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
253
- special_features=["SOTA Arabic", "Islamic values aligned", "Vision 2030"]
254
- ),
255
- "inception-mbzuai/jais-13b": TokenizerInfo(
256
- name="Jais 13B",
257
- model_id="inception-mbzuai/jais-13b",
258
- type=TokenizerType.ARABIC_LLM,
259
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
260
- vocab_size=84992,
261
- description="World's most advanced Arabic LLM, trained from scratch",
262
- organization="Inception/MBZUAI",
263
- arabic_support="Native",
264
- dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
265
- special_features=["Arabic-first", "Lowest fertility", "UAE-native"]
266
- ),
267
- "inceptionai/jais-family-30b-8k-chat": TokenizerInfo(
268
- name="Jais 30B Chat",
269
- model_id="inceptionai/jais-family-30b-8k-chat",
270
- type=TokenizerType.ARABIC_LLM,
271
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
272
- vocab_size=84992,
273
- description="Enhanced 30B version with chat capabilities",
274
- organization="Inception AI",
275
- arabic_support="Native",
276
- dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
277
- special_features=["30B parameters", "Chat optimized", "8K context"]
278
- ),
279
- "FreedomIntelligence/AceGPT-13B-chat": TokenizerInfo(
280
- name="AceGPT 13B Chat",
281
- model_id="FreedomIntelligence/AceGPT-13B-chat",
282
- type=TokenizerType.ARABIC_LLM,
283
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
284
- vocab_size=32000,
285
- description="Arabic-enhanced LLaMA with cultural alignment and chat",
286
- organization="Freedom Intelligence",
287
- arabic_support="Adapted",
288
- dialect_support=["MSA"],
289
- special_features=["LLaMA-based", "Cultural alignment", "RLHF", "Chat"]
290
- ),
291
- "silma-ai/SILMA-9B-Instruct-v1.0": TokenizerInfo(
292
- name="SILMA 9B Instruct",
293
- model_id="silma-ai/SILMA-9B-Instruct-v1.0",
294
- type=TokenizerType.ARABIC_LLM,
295
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
296
- vocab_size=256000,
297
- description="Top-ranked Arabic LLM based on Gemma, outperforms larger models",
298
- organization="SILMA AI",
299
- arabic_support="Native",
300
- dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
301
- special_features=["Gemma-based", "SOTA 9B class", "Efficient"]
302
- ),
303
- "QCRI/Fanar-1-9B-Instruct": TokenizerInfo(
304
- name="Fanar 9B Instruct",
305
- model_id="QCRI/Fanar-1-9B-Instruct",
306
- type=TokenizerType.ARABIC_LLM,
307
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
308
- vocab_size=256000,
309
- description="Qatar's Arabic LLM aligned with Islamic values and Arab culture",
310
- organization="QCRI (Qatar)",
311
- arabic_support="Native",
312
- dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
313
- special_features=["Islamic RAG", "Cultural alignment", "Gemma-based"]
314
- ),
315
- "Navid-AI/Yehia-7B-preview": TokenizerInfo(
316
- name="Yehia 7B Preview",
317
- model_id="Navid-AI/Yehia-7B-preview",
318
- type=TokenizerType.ARABIC_LLM,
319
- algorithm=TokenizerAlgorithm.BPE,
320
- vocab_size=128256,
321
- description="Best Arabic model on AraGen-Leaderboard (0.5B-25B), GRPO trained",
322
- organization="Navid AI",
323
- arabic_support="Native",
324
- dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
325
- special_features=["GRPO trained", "3C3H aligned", "SOTA AraGen"]
326
- ),
327
-
328
- # ========== DIALECT-SPECIFIC MODELS ==========
329
- "MBZUAI-Paris/Atlas-Chat-9B": TokenizerInfo(
330
- name="Atlas-Chat 9B (Darija)",
331
- model_id="MBZUAI-Paris/Atlas-Chat-9B",
332
- type=TokenizerType.ARABIC_LLM,
333
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
334
- vocab_size=256000,
335
- description="First LLM for Moroccan Arabic (Darija), Gemma-based",
336
- organization="MBZUAI Paris",
337
- arabic_support="Native",
338
- dialect_support=["Darija", "MSA"],
339
- special_features=["Moroccan dialect", "Transliteration", "Cultural"]
340
- ),
341
-
342
- # ========== MULTILINGUAL LLMs WITH ARABIC SUPPORT ==========
343
- "Qwen/Qwen2.5-7B": TokenizerInfo(
344
- name="Qwen 2.5 7B",
345
- model_id="Qwen/Qwen2.5-7B",
346
- type=TokenizerType.MULTILINGUAL_LLM,
347
- algorithm=TokenizerAlgorithm.BPE,
348
- vocab_size=151936,
349
- description="Alibaba's multilingual LLM with 30+ language support",
350
- organization="Alibaba Qwen",
351
- arabic_support="Supported",
352
- dialect_support=["MSA"],
353
- special_features=["152K vocab", "128K context", "30+ languages"]
354
- ),
355
- "google/gemma-2-9b": TokenizerInfo(
356
- name="Gemma 2 9B",
357
- model_id="google/gemma-2-9b",
358
- type=TokenizerType.MULTILINGUAL_LLM,
359
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
360
- vocab_size=256000,
361
- description="Google's efficient multilingual model",
362
- organization="Google",
363
- arabic_support="Supported",
364
- dialect_support=["MSA"],
365
- special_features=["256K vocab", "Efficient architecture"]
366
- ),
367
- "mistralai/Mistral-7B-v0.3": TokenizerInfo(
368
- name="Mistral 7B v0.3",
369
- model_id="mistralai/Mistral-7B-v0.3",
370
- type=TokenizerType.MULTILINGUAL_LLM,
371
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
372
- vocab_size=32768,
373
- description="Efficient multilingual model with sliding window attention",
374
- organization="Mistral AI",
375
- arabic_support="Limited",
376
- dialect_support=["MSA"],
377
- special_features=["Sliding window", "Efficient"]
378
- ),
379
- "mistralai/Mistral-Nemo-Base-2407": TokenizerInfo(
380
- name="Mistral Nemo",
381
- model_id="mistralai/Mistral-Nemo-Base-2407",
382
- type=TokenizerType.MULTILINGUAL_LLM,
383
- algorithm=TokenizerAlgorithm.TIKTOKEN,
384
- vocab_size=131072,
385
- description="Uses Tekken tokenizer, optimized for multilingual",
386
- organization="Mistral AI + NVIDIA",
387
- arabic_support="Supported",
388
- dialect_support=["MSA"],
389
- special_features=["Tekken tokenizer", "131K vocab", "Multilingual optimized"]
390
- ),
391
- "xlm-roberta-base": TokenizerInfo(
392
- name="XLM-RoBERTa Base",
393
- model_id="xlm-roberta-base",
394
- type=TokenizerType.MULTILINGUAL_LLM,
395
- algorithm=TokenizerAlgorithm.SENTENCEPIECE,
396
- vocab_size=250002,
397
- description="Cross-lingual model covering 100 languages",
398
- organization="Facebook AI",
399
- arabic_support="Supported",
400
- dialect_support=["MSA"],
401
- special_features=["250K vocab", "100 languages"]
402
- ),
403
- "bert-base-multilingual-cased": TokenizerInfo(
404
- name="mBERT",
405
- model_id="bert-base-multilingual-cased",
406
- type=TokenizerType.MULTILINGUAL_LLM,
407
- algorithm=TokenizerAlgorithm.WORDPIECE,
408
- vocab_size=119547,
409
- description="Original multilingual BERT, baseline for comparison",
410
- organization="Google",
411
- arabic_support="Limited",
412
- dialect_support=["MSA"],
413
- special_features=["Baseline model", "104 languages"]
414
- ),
415
- "tiiuae/falcon-7b": TokenizerInfo(
416
- name="Falcon 7B",
417
- model_id="tiiuae/falcon-7b",
418
- type=TokenizerType.MULTILINGUAL_LLM,
419
- algorithm=TokenizerAlgorithm.BPE,
420
- vocab_size=65024,
421
- description="TII's powerful open-source LLM",
422
- organization="Technology Innovation Institute",
423
- arabic_support="Limited",
424
- dialect_support=["MSA"],
425
- special_features=["65K vocab", "RefinedWeb trained"]
426
- ),
427
- }
428
 
429
- # ============================================================================
430
- # LEADERBOARD DATASETS CONFIGURATION - Real HuggingFace Datasets
431
- # ============================================================================
432
-
433
- LEADERBOARD_DATASETS = {
434
- # MSA Benchmarks
435
- "arabic_mmlu": {
436
- "hf_id": "MBZUAI/ArabicMMLU",
437
- "name": "ArabicMMLU",
438
- "category": "MSA Benchmark",
439
- "text_column": "Question",
440
- "split": "test",
441
- "subset": None,
442
- "samples": 500,
443
- "description": "Multi-task benchmark from Arab school exams (14,575 MCQs)"
444
- },
445
-
446
- # Dialectal Arabic
447
- "arsentd_lev": {
448
- "hf_id": "ramybaly/arsentd_lev",
449
- "name": "ArSenTD-LEV",
450
- "category": "Levantine Dialect",
451
- "text_column": "Tweet",
452
- "split": "train",
453
- "subset": None,
454
- "samples": 500,
455
- "description": "Levantine Arabic tweets (Jordan, Lebanon, Syria, Palestine)"
456
- },
457
-
458
- # Classical Arabic
459
- "athar": {
460
- "hf_id": "mohamed-khalil/ATHAR",
461
- "name": "ATHAR Classical",
462
- "category": "Classical Arabic",
463
- "text_column": "arabic",
464
- "split": "train",
465
- "subset": None,
466
- "samples": 500,
467
- "description": "66K classical Arabic sentences with translations"
468
- },
469
-
470
- # Question Answering
471
- "arcd": {
472
- "hf_id": "arcd",
473
- "name": "ARCD",
474
- "category": "QA Dataset",
475
- "text_column": "context",
476
- "split": "train",
477
- "subset": None,
478
- "samples": 300,
479
- "description": "Arabic Reading Comprehension Dataset (1,395 questions)"
480
- },
481
-
482
- # Poetry
483
- "ashaar": {
484
- "hf_id": "arbml/Ashaar_dataset",
485
- "name": "Ashaar Poetry",
486
- "category": "Poetry",
487
- "text_column": "poem_text",
488
- "split": "train",
489
- "subset": None,
490
- "samples": 500,
491
- "description": "2M+ Arabic poetry verses with meter and theme labels"
492
- },
493
-
494
- # Religious - Hadith
495
- "hadith": {
496
- "hf_id": "gurgutan/sunnah_ar_en_dataset",
497
- "name": "Hadith Collection",
498
- "category": "Religious",
499
- "text_column": "hadith_text_ar",
500
- "split": "train",
501
- "subset": None,
502
- "samples": 400,
503
- "description": "50,762 hadiths from 14 authentic books"
504
- },
505
-
506
- # Social Media
507
- "arabic_sentiment": {
508
- "hf_id": "arbml/Arabic_Sentiment_Twitter_Corpus",
509
- "name": "Arabic Sentiment",
510
- "category": "Social Media",
511
- "text_column": "text",
512
- "split": "train",
513
- "subset": None,
514
- "samples": 500,
515
- "description": "Arabic Twitter sentiment corpus"
516
- },
517
-
518
- # News
519
- "sanad": {
520
- "hf_id": "arbml/SANAD",
521
- "name": "SANAD News",
522
- "category": "News",
523
- "text_column": "text",
524
- "split": "train",
525
- "subset": "alarabiya",
526
- "samples": 400,
527
- "description": "Arabic news articles from Al Arabiya"
528
- },
529
- }
530
-
531
- # ============================================================================
532
- # TOKENIZER LOADER AND CACHE
533
- # ============================================================================
534
-
535
- class TokenizerManager:
536
- """Manages tokenizer loading and caching"""
537
-
538
- def __init__(self):
539
- self._cache: Dict[str, Any] = {}
540
- self._available: Dict[str, TokenizerInfo] = {}
541
- self._initialize_available_tokenizers()
542
-
543
- def _initialize_available_tokenizers(self):
544
- """Check which tokenizers are available and can be loaded"""
545
- print("Initializing tokenizer registry...")
546
-
547
- # Add all base tokenizers
548
- for model_id, info in TOKENIZER_REGISTRY.items():
549
- try:
550
- # Quick check if tokenizer can be loaded
551
- _ = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
552
- self._available[model_id] = info
553
- print(f" ✓ {info.name}")
554
- except Exception as e:
555
- print(f" ✗ {info.name}: {str(e)[:50]}")
556
-
557
- print(f"\nTotal available tokenizers: {len(self._available)}")
558
-
559
- def get_tokenizer(self, model_id: str):
560
- """Get tokenizer from cache or load it"""
561
- if model_id not in self._cache:
562
- self._cache[model_id] = AutoTokenizer.from_pretrained(
563
- model_id,
564
- trust_remote_code=True
565
- )
566
- return self._cache[model_id]
567
-
568
- def get_available_tokenizers(self) -> Dict[str, TokenizerInfo]:
569
- return self._available
570
-
571
- def get_tokenizer_choices(self) -> List[str]:
572
- """Get list of tokenizer display names for dropdown"""
573
- return [f"{info.name} ({info.organization})" for info in self._available.values()]
574
-
575
- def get_model_id_from_choice(self, choice: str) -> str:
576
- """Convert display choice back to model ID"""
577
- for model_id, info in self._available.items():
578
- if f"{info.name} ({info.organization})" == choice:
579
- return model_id
580
- return list(self._available.keys())[0]
581
-
582
- # Global tokenizer manager
583
- tokenizer_manager = TokenizerManager()
584
-
585
- # ============================================================================
586
- # ARABIC TEXT UTILITIES
587
- # ============================================================================
588
-
589
- def is_arabic_char(char: str) -> bool:
590
- """Check if character is Arabic"""
591
- if len(char) != 1:
592
- return False
593
- code = ord(char)
594
- return (
595
- (0x0600 <= code <= 0x06FF) or # Arabic
596
- (0x0750 <= code <= 0x077F) or # Arabic Supplement
597
- (0x08A0 <= code <= 0x08FF) or # Arabic Extended-A
598
- (0xFB50 <= code <= 0xFDFF) or # Arabic Presentation Forms-A
599
- (0xFE70 <= code <= 0xFEFF) # Arabic Presentation Forms-B
600
- )
601
-
602
- def count_arabic_chars(text: str) -> int:
603
- """Count Arabic characters in text"""
604
- return sum(1 for c in text if is_arabic_char(c))
605
-
606
- def has_diacritics(text: str) -> bool:
607
- """Check if text contains Arabic diacritics (tashkeel)"""
608
- diacritics = set('ًٌٍَُِّْٰ')
609
- return any(c in diacritics for c in text)
610
-
611
- def get_arabic_words(text: str) -> List[str]:
612
- """Extract Arabic words from text"""
613
- words = text.split()
614
- return [w for w in words if any(is_arabic_char(c) for c in w)]
615
-
616
- # ============================================================================
617
- # TOKENIZATION ANALYSIS ENGINE
618
- # ============================================================================
619
-
620
- def analyze_tokenization(
621
- text: str,
622
- model_id: str,
623
- tokenizer_info: TokenizerInfo
624
- ) -> TokenizationMetrics:
625
- """Perform comprehensive tokenization analysis"""
626
-
627
- tokenizer = tokenizer_manager.get_tokenizer(model_id)
628
-
629
- # Time the tokenization
630
- start_time = time.perf_counter()
631
- tokens = tokenizer.tokenize(text)
632
- token_ids = tokenizer.encode(text, add_special_tokens=False)
633
- tokenization_time = (time.perf_counter() - start_time) * 1000
634
-
635
- decoded = tokenizer.decode(token_ids, skip_special_tokens=True)
636
-
637
- # Basic counts
638
- words = text.split()
639
- total_words = len(words)
640
- total_tokens = len(tokens)
641
- total_characters = len(text)
642
- total_bytes = len(text.encode('utf-8'))
643
-
644
- # Efficiency metrics
645
- fertility = total_tokens / max(total_words, 1)
646
- compression_ratio = total_bytes / max(total_tokens, 1)
647
- char_per_token = total_characters / max(total_tokens, 1)
648
-
649
- # OOV analysis
650
- unk_token = tokenizer.unk_token if hasattr(tokenizer, 'unk_token') else '[UNK]'
651
- oov_count = sum(1 for t in tokens if t == unk_token or '[UNK]' in str(t))
652
- oov_percentage = (oov_count / max(total_tokens, 1)) * 100
653
-
654
- # Single Token Retention Rate (STRR)
655
- single_token_words = 0
656
- subwords_per_word = []
657
-
658
- for word in words:
659
- word_tokens = tokenizer.tokenize(word)
660
- subwords_per_word.append(len(word_tokens))
661
- if len(word_tokens) == 1:
662
- single_token_words += 1
663
-
664
- strr = single_token_words / max(total_words, 1)
665
- avg_subwords = sum(subwords_per_word) / max(len(subwords_per_word), 1)
666
- max_subwords = max(subwords_per_word) if subwords_per_word else 0
667
- continued_ratio = (total_words - single_token_words) / max(total_words, 1)
668
-
669
- # Arabic-specific metrics
670
- arabic_char_count = count_arabic_chars(text)
671
- arabic_words = get_arabic_words(text)
672
- arabic_tokens_count = 0
673
-
674
- for token in tokens:
675
- if any(is_arabic_char(c) for c in str(token)):
676
- arabic_tokens_count += 1
677
-
678
- arabic_fertility = arabic_tokens_count / max(len(arabic_words), 1) if arabic_words else 0
679
- diacritic_preserved = has_diacritics(text) == has_diacritics(decoded)
680
-
681
- return TokenizationMetrics(
682
- total_tokens=total_tokens,
683
- total_words=total_words,
684
- total_characters=total_characters,
685
- total_bytes=total_bytes,
686
- fertility=fertility,
687
- compression_ratio=compression_ratio,
688
- char_per_token=char_per_token,
689
- oov_count=oov_count,
690
- oov_percentage=oov_percentage,
691
- single_token_words=single_token_words,
692
- single_token_retention_rate=strr,
693
- avg_subwords_per_word=avg_subwords,
694
- max_subwords_per_word=max_subwords,
695
- continued_words_ratio=continued_ratio,
696
- arabic_char_count=arabic_char_count,
697
- arabic_token_count=arabic_tokens_count,
698
- arabic_fertility=arabic_fertility,
699
- diacritic_preservation=diacritic_preserved,
700
- tokenization_time_ms=tokenization_time,
701
- tokens=tokens,
702
- token_ids=token_ids,
703
- decoded_text=decoded
704
- )
705
-
706
- # ============================================================================
707
- # LEADERBOARD FUNCTIONS - Import Real Datasets from HuggingFace
708
- # ============================================================================
709
-
710
- class HFDatasetLoader:
711
- """Load Arabic datasets from HuggingFace"""
712
-
713
- def __init__(self):
714
- self.cache = {}
715
-
716
- def load_dataset_texts(self, dataset_key: str) -> Tuple[List[str], str]:
717
- """Load texts from a HuggingFace dataset"""
718
-
719
- if dataset_key in self.cache:
720
- return self.cache[dataset_key], f"✅ Loaded {len(self.cache[dataset_key])} samples (cached)"
721
-
722
- config = LEADERBOARD_DATASETS.get(dataset_key)
723
- if not config:
724
- return [], f"❌ Unknown dataset: {dataset_key}"
725
-
726
- try:
727
- # Load dataset from HuggingFace
728
- if config.get("subset"):
729
- ds = load_dataset(
730
- config["hf_id"],
731
- config["subset"],
732
- split=config["split"],
733
- trust_remote_code=True
734
- )
735
- else:
736
- ds = load_dataset(
737
- config["hf_id"],
738
- split=config["split"],
739
- trust_remote_code=True
740
- )
741
-
742
- texts = []
743
- text_col = config["text_column"]
744
-
745
- # Try to find text column
746
- if text_col not in ds.column_names:
747
- for col in ["text", "content", "sentence", "arabic", "context", "Tweet", "question", "poem_text", "hadith_text_ar"]:
748
- if col in ds.column_names:
749
- text_col = col
750
- break
751
-
752
- # Extract texts
753
- max_samples = config.get("samples", 500)
754
- for i, item in enumerate(ds):
755
- if i >= max_samples:
756
- break
757
- text = item.get(text_col, "")
758
- if text and isinstance(text, str) and len(text.strip()) > 10:
759
- texts.append(text.strip())
760
-
761
- self.cache[dataset_key] = texts
762
- return texts, f"✅ Loaded {len(texts)} samples from HuggingFace"
763
-
764
- except Exception as e:
765
- return [], f"❌ Error loading {config['hf_id']}: {str(e)[:80]}"
766
-
767
- def evaluate_tokenizer_on_texts(tokenizer, texts: List[str]) -> Optional[Dict]:
768
- """Evaluate a tokenizer on a list of texts"""
769
-
770
- fertilities = []
771
- compressions = []
772
- unk_counts = 0
773
- total_tokens = 0
774
-
775
- for text in texts:
776
- try:
777
- tokens = tokenizer.encode(text, add_special_tokens=False)
778
- decoded = tokenizer.convert_ids_to_tokens(tokens)
779
-
780
- num_tokens = len(tokens)
781
- num_words = len(text.split()) or 1
782
- num_bytes = len(text.encode('utf-8'))
783
-
784
- fertility = num_tokens / num_words
785
- compression = num_bytes / num_tokens if num_tokens > 0 else 0
786
-
787
- # Count UNKs
788
- unk_token = getattr(tokenizer, 'unk_token', '[UNK]')
789
- unks = sum(1 for t in decoded if t and (t == unk_token or '<unk>' in str(t).lower() or '[unk]' in str(t).lower()))
790
-
791
- fertilities.append(fertility)
792
- compressions.append(compression)
793
- unk_counts += unks
794
- total_tokens += num_tokens
795
-
796
- except Exception:
797
- continue
798
-
799
- if not fertilities:
800
- return None
801
-
802
- return {
803
- "avg_fertility": statistics.mean(fertilities),
804
- "std_fertility": statistics.stdev(fertilities) if len(fertilities) > 1 else 0,
805
- "avg_compression": statistics.mean(compressions),
806
- "unk_ratio": unk_counts / total_tokens if total_tokens > 0 else 0,
807
- "samples": len(fertilities)
808
- }
809
-
810
- def calculate_leaderboard_score(fertility: float, compression: float, unk_ratio: float) -> float:
811
- """Calculate overall score (0-100, higher is better)"""
812
- # Lower fertility is better (ideal ~1.0 for Arabic)
813
- fertility_score = max(0, min(1, 2.0 / fertility)) if fertility > 0 else 0
814
- # Higher compression is better
815
- compression_score = min(1, compression / 6)
816
- # Lower UNK is better
817
- unk_score = 1 - min(1, unk_ratio * 20)
818
-
819
- # Weighted combination
820
- score = (fertility_score * 0.45 + compression_score * 0.35 + unk_score * 0.20) * 100
821
- return round(score, 1)
822
-
823
- def run_leaderboard_evaluation(
824
- selected_datasets: List[str],
825
- selected_tokenizers: List[str],
826
- progress=gr.Progress()
827
- ) -> Tuple[str, str, str]:
828
- """
829
- Run the full leaderboard evaluation with real HF datasets
830
- Returns: (leaderboard_html, per_dataset_html, status_message)
831
- """
832
-
833
- if not selected_datasets:
834
- return "", "", "⚠️ Please select at least one dataset"
835
-
836
- if not selected_tokenizers:
837
- return "", "", "⚠️ Please select at least one tokenizer"
838
-
839
- loader = HFDatasetLoader()
840
- results = defaultdict(dict)
841
-
842
- # Status tracking
843
- status_lines = []
844
-
845
- # Load datasets from HuggingFace
846
- status_lines.append("📚 **Loading Datasets from HuggingFace:**\n")
847
- loaded_datasets = {}
848
-
849
- for i, ds_key in enumerate(selected_datasets):
850
- progress((i + 1) / len(selected_datasets) * 0.3, f"Loading {ds_key}...")
851
- texts, msg = loader.load_dataset_texts(ds_key)
852
- ds_name = LEADERBOARD_DATASETS[ds_key]["name"]
853
- status_lines.append(f" • {ds_name}: {msg}")
854
- if texts:
855
- loaded_datasets[ds_key] = texts
856
-
857
- if not loaded_datasets:
858
- return "", "", "\n".join(status_lines) + "\n\n❌ No datasets loaded successfully"
859
-
860
- # Evaluate tokenizers
861
- status_lines.append("\n🔄 **Evaluating Tokenizers:**\n")
862
-
863
- tokenizer_cache = {}
864
- total_steps = len(selected_tokenizers) * len(loaded_datasets)
865
- current_step = 0
866
-
867
- for tok_choice in selected_tokenizers:
868
- # Get model ID from choice
869
- tok_id = tokenizer_manager.get_model_id_from_choice(tok_choice)
870
- tok_info = tokenizer_manager.get_available_tokenizers().get(tok_id)
871
- tok_name = tok_info.name if tok_info else tok_choice
872
-
873
- # Load tokenizer
874
- try:
875
- if tok_id not in tokenizer_cache:
876
- tokenizer_cache[tok_id] = AutoTokenizer.from_pretrained(
877
- tok_id, trust_remote_code=True
878
- )
879
- tokenizer = tokenizer_cache[tok_id]
880
- status_lines.append(f" • {tok_name}: ✅ Loaded")
881
- except Exception as e:
882
- status_lines.append(f" • {tok_name}: ❌ Failed ({str(e)[:30]})")
883
- continue
884
-
885
- # Evaluate on each dataset
886
- for ds_key, texts in loaded_datasets.items():
887
- current_step += 1
888
- progress(0.3 + (current_step / total_steps) * 0.6, f"Evaluating {tok_name} on {ds_key}...")
889
-
890
- metrics = evaluate_tokenizer_on_texts(tokenizer, texts)
891
- if metrics:
892
- results[tok_choice][ds_key] = metrics
893
-
894
- # Generate leaderboard
895
- progress(0.95, "Generating leaderboard...")
896
-
897
- leaderboard_data = []
898
- per_dataset_data = []
899
-
900
- for tok_choice, ds_results in results.items():
901
- if not ds_results:
902
- continue
903
-
904
- tok_id = tokenizer_manager.get_model_id_from_choice(tok_choice)
905
- tok_info = tokenizer_manager.get_available_tokenizers().get(tok_id)
906
-
907
- # Aggregate across datasets
908
- all_fertility = [m["avg_fertility"] for m in ds_results.values()]
909
- all_compression = [m["avg_compression"] for m in ds_results.values()]
910
- all_unk = [m["unk_ratio"] for m in ds_results.values()]
911
-
912
- avg_fertility = statistics.mean(all_fertility)
913
- avg_compression = statistics.mean(all_compression)
914
- avg_unk = statistics.mean(all_unk)
915
-
916
- score = calculate_leaderboard_score(avg_fertility, avg_compression, avg_unk)
917
-
918
- leaderboard_data.append({
919
- "name": tok_info.name if tok_info else tok_choice,
920
- "type": tok_info.type.value if tok_info else "Unknown",
921
- "org": tok_info.organization if tok_info else "Unknown",
922
- "score": score,
923
- "fertility": avg_fertility,
924
- "compression": avg_compression,
925
- "unk_ratio": avg_unk,
926
- "num_datasets": len(ds_results)
927
- })
928
-
929
- # Per-dataset row
930
- per_ds_row = {"Tokenizer": tok_info.name if tok_info else tok_choice}
931
- for ds_key in selected_datasets:
932
- ds_name = LEADERBOARD_DATASETS[ds_key]["name"]
933
- if ds_key in ds_results:
934
- per_ds_row[ds_name] = round(ds_results[ds_key]["avg_fertility"], 2)
935
- else:
936
- per_ds_row[ds_name] = "-"
937
- per_dataset_data.append(per_ds_row)
938
-
939
- # Sort by score
940
- leaderboard_data.sort(key=lambda x: x["score"], reverse=True)
941
-
942
- # Create HTML tables
943
- leaderboard_html = generate_leaderboard_html(leaderboard_data)
944
- per_dataset_html = generate_per_dataset_html(per_dataset_data, selected_datasets)
945
-
946
- status_lines.append(f"\n✅ **Evaluation Complete!** Evaluated {len(results)} tokenizers on {len(loaded_datasets)} datasets.")
947
-
948
- return leaderboard_html, per_dataset_html, "\n".join(status_lines)
949
-
950
- def generate_leaderboard_html(data: List[Dict]) -> str:
951
- """Generate HTML for main leaderboard"""
952
-
953
- if not data:
954
- return "<p>No results to display</p>"
955
-
956
- html = """
957
- <style>
958
- .leaderboard-table {
959
- width: 100%;
960
- border-collapse: collapse;
961
- font-family: system-ui, -apple-system, sans-serif;
962
- margin: 20px 0;
963
- }
964
- .leaderboard-table th {
965
- background: linear-gradient(135deg, #1a5f2a 0%, #2d8f4e 100%);
966
- color: white;
967
- padding: 12px 8px;
968
- text-align: left;
969
- font-weight: 600;
970
- }
971
- .leaderboard-table td {
972
- padding: 10px 8px;
973
- border-bottom: 1px solid #e0e0e0;
974
- }
975
- .leaderboard-table tr:nth-child(even) {
976
- background-color: #f8f9fa;
977
- }
978
- .leaderboard-table tr:hover {
979
- background-color: #e8f5e9;
980
- }
981
- .rank-1 { background: linear-gradient(90deg, #ffd700 0%, #fff8dc 100%) !important; }
982
- .rank-2 { background: linear-gradient(90deg, #c0c0c0 0%, #f5f5f5 100%) !important; }
983
- .rank-3 { background: linear-gradient(90deg, #cd7f32 0%, #ffe4c4 100%) !important; }
984
- .score-badge {
985
- background: #2d8f4e;
986
- color: white;
987
- padding: 4px 8px;
988
- border-radius: 12px;
989
- font-weight: bold;
990
- }
991
- .type-badge {
992
- background: #e3f2fd;
993
- color: #1565c0;
994
- padding: 2px 6px;
995
- border-radius: 4px;
996
- font-size: 0.85em;
997
- }
998
- .metric-good { color: #2e7d32; font-weight: 600; }
999
- .metric-bad { color: #c62828; }
1000
- </style>
1001
-
1002
- <table class="leaderboard-table">
1003
- <thead>
1004
- <tr>
1005
- <th>Rank</th>
1006
- <th>Tokenizer</th>
1007
- <th>Type</th>
1008
- <th>Organization</th>
1009
- <th>Score ↑</th>
1010
- <th>Fertility ↓</th>
1011
- <th>Compression ↑</th>
1012
- <th>UNK Rate ↓</th>
1013
- <th>Datasets</th>
1014
- </tr>
1015
- </thead>
1016
- <tbody>
1017
- """
1018
-
1019
- for i, entry in enumerate(data):
1020
- rank = i + 1
1021
- rank_class = f"rank-{rank}" if rank <= 3 else ""
1022
-
1023
- # Color coding for metrics
1024
- fert_class = "metric-good" if entry["fertility"] < 2.0 else "metric-bad" if entry["fertility"] > 3.0 else ""
1025
- comp_class = "metric-good" if entry["compression"] > 3.5 else ""
1026
- unk_class = "metric-good" if entry["unk_ratio"] < 0.01 else "metric-bad" if entry["unk_ratio"] > 0.05 else ""
1027
-
1028
- html += f"""
1029
- <tr class="{rank_class}">
1030
- <td><strong>#{rank}</strong></td>
1031
- <td><strong>{entry["name"]}</strong></td>
1032
- <td><span class="type-badge">{entry["type"]}</span></td>
1033
- <td>{entry["org"]}</td>
1034
- <td><span class="score-badge">{entry["score"]}</span></td>
1035
- <td class="{fert_class}">{entry["fertility"]:.3f}</td>
1036
- <td class="{comp_class}">{entry["compression"]:.2f}</td>
1037
- <td class="{unk_class}">{entry["unk_ratio"]:.2%}</td>
1038
- <td>{entry["num_datasets"]}</td>
1039
- </tr>
1040
- """
1041
-
1042
- html += """
1043
- </tbody>
1044
- </table>
1045
-
1046
- <div style="margin-top: 15px; padding: 10px; background: #f5f5f5; border-radius: 8px; font-size: 0.9em;">
1047
- <strong>📊 Metric Guide:</strong><br>
1048
- • <strong>Score:</strong> Overall ranking (0-100, higher = better)<br>
1049
- • <strong>Fertility:</strong> Tokens per word (lower = better, 1.0 ideal for Arabic)<br>
1050
- • <strong>Compression:</strong> Bytes per token (higher = more efficient)<br>
1051
- • <strong>UNK Rate:</strong> Unknown token percentage (lower = better)
1052
- </div>
1053
- """
1054
-
1055
- return html
1056
-
1057
- def generate_per_dataset_html(data: List[Dict], dataset_keys: List[str]) -> str:
1058
- """Generate HTML for per-dataset fertility table"""
1059
-
1060
- if not data:
1061
- return "<p>No per-dataset results</p>"
1062
-
1063
- ds_names = [LEADERBOARD_DATASETS[k]["name"] for k in dataset_keys]
1064
-
1065
- html = """
1066
- <style>
1067
- .dataset-table {
1068
- width: 100%;
1069
- border-collapse: collapse;
1070
- font-family: system-ui, -apple-system, sans-serif;
1071
- margin: 20px 0;
1072
- font-size: 0.9em;
1073
- }
1074
- .dataset-table th {
1075
- background: #37474f;
1076
- color: white;
1077
- padding: 10px 6px;
1078
- text-align: center;
1079
- }
1080
- .dataset-table th:first-child {
1081
- text-align: left;
1082
- }
1083
- .dataset-table td {
1084
- padding: 8px 6px;
1085
- text-align: center;
1086
- border-bottom: 1px solid #e0e0e0;
1087
- }
1088
- .dataset-table td:first-child {
1089
- text-align: left;
1090
- font-weight: 500;
1091
- }
1092
- .dataset-table tr:nth-child(even) {
1093
- background-color: #fafafa;
1094
- }
1095
- .fert-excellent { background: #c8e6c9; color: #1b5e20; font-weight: 600; }
1096
- .fert-good { background: #fff9c4; color: #f57f17; }
1097
- .fert-poor { background: #ffcdd2; color: #b71c1c; }
1098
- </style>
1099
-
1100
- <h4>📈 Fertility per Dataset (tokens/word - lower is better)</h4>
1101
- <table class="dataset-table">
1102
- <thead>
1103
- <tr>
1104
- <th>Tokenizer</th>
1105
- """
1106
-
1107
- for ds_name in ds_names:
1108
- html += f"<th>{ds_name}</th>"
1109
-
1110
- html += """
1111
- </tr>
1112
- </thead>
1113
- <tbody>
1114
- """
1115
-
1116
- for row in data:
1117
- html += f"<tr><td>{row['Tokenizer']}</td>"
1118
- for ds_name in ds_names:
1119
- val = row.get(ds_name, "-")
1120
- if val != "-":
1121
- if val < 1.8:
1122
- cls = "fert-excellent"
1123
- elif val < 2.5:
1124
- cls = "fert-good"
1125
- else:
1126
- cls = "fert-poor"
1127
- html += f'<td class="{cls}">{val}</td>'
1128
- else:
1129
- html += '<td>-</td>'
1130
- html += "</tr>"
1131
-
1132
- html += """
1133
- </tbody>
1134
- </table>
1135
- """
1136
-
1137
- return html
1138
-
1139
- # ============================================================================
1140
- # UI GENERATION FUNCTIONS
1141
- # ============================================================================
1142
-
1143
- def generate_token_visualization(tokens: List[str], token_ids: List[int]) -> str:
1144
- """Generate beautiful HTML visualization of tokens"""
1145
-
1146
- colors = [
1147
- ('#1a1a2e', '#eaeaea'),
1148
- ('#16213e', '#f0f0f0'),
1149
- ('#0f3460', '#ffffff'),
1150
- ('#533483', '#f5f5f5'),
1151
- ('#e94560', '#ffffff'),
1152
- ('#0f4c75', '#f0f0f0'),
1153
- ('#3282b8', '#ffffff'),
1154
- ('#bbe1fa', '#1a1a2e'),
1155
- ]
1156
-
1157
- html_parts = []
1158
- for i, (token, tid) in enumerate(zip(tokens, token_ids)):
1159
- bg, fg = colors[i % len(colors)]
1160
- display_token = token.replace('<', '&lt;').replace('>', '&gt;')
1161
- is_arabic = any(is_arabic_char(c) for c in token)
1162
- direction = 'rtl' if is_arabic else 'ltr'
1163
-
1164
- html_parts.append(f'''
1165
- <span class="token" style="
1166
- background: {bg};
1167
- color: {fg};
1168
- direction: {direction};
1169
- " title="ID: {tid}">
1170
- {display_token}
1171
- <span class="token-id">{tid}</span>
1172
- </span>
1173
- ''')
1174
-
1175
- return f'''
1176
- <div class="token-container">
1177
- {''.join(html_parts)}
1178
- </div>
1179
- '''
1180
-
1181
- def generate_metrics_card(metrics: TokenizationMetrics, info: TokenizerInfo) -> str:
1182
- """Generate metrics visualization card"""
1183
-
1184
- fertility_quality = "excellent" if metrics.fertility < 1.5 else "good" if metrics.fertility < 2.5 else "poor"
1185
- strr_quality = "excellent" if metrics.single_token_retention_rate > 0.5 else "good" if metrics.single_token_retention_rate > 0.3 else "poor"
1186
- compression_quality = "excellent" if metrics.compression_ratio > 4 else "good" if metrics.compression_ratio > 2.5 else "poor"
1187
-
1188
- return f'''
1189
- <div class="metrics-grid">
1190
- <div class="metric-card primary">
1191
- <div class="metric-icon">📊</div>
1192
- <div class="metric-value">{metrics.total_tokens}</div>
1193
- <div class="metric-label">Total Tokens</div>
1194
- </div>
1195
-
1196
- <div class="metric-card {fertility_quality}">
1197
- <div class="metric-icon">🎯</div>
1198
- <div class="metric-value">{metrics.fertility:.3f}</div>
1199
- <div class="metric-label">Fertility (tokens/word)</div>
1200
- <div class="metric-hint">Lower is better (1.0 ideal)</div>
1201
- </div>
1202
-
1203
- <div class="metric-card {compression_quality}">
1204
- <div class="metric-icon">📦</div>
1205
- <div class="metric-value">{metrics.compression_ratio:.2f}</div>
1206
- <div class="metric-label">Compression (bytes/token)</div>
1207
- <div class="metric-hint">Higher is better</div>
1208
- </div>
1209
-
1210
- <div class="metric-card {strr_quality}">
1211
- <div class="metric-icon">✨</div>
1212
- <div class="metric-value">{metrics.single_token_retention_rate:.1%}</div>
1213
- <div class="metric-label">STRR (Single Token Retention)</div>
1214
- <div class="metric-hint">Higher is better</div>
1215
- </div>
1216
-
1217
- <div class="metric-card">
1218
- <div class="metric-icon">🔤</div>
1219
- <div class="metric-value">{metrics.char_per_token:.2f}</div>
1220
- <div class="metric-label">Characters/Token</div>
1221
- </div>
1222
-
1223
- <div class="metric-card {'excellent' if metrics.oov_percentage == 0 else 'poor' if metrics.oov_percentage > 5 else 'good'}">
1224
- <div class="metric-icon">❓</div>
1225
- <div class="metric-value">{metrics.oov_percentage:.1f}%</div>
1226
- <div class="metric-label">OOV Rate</div>
1227
- <div class="metric-hint">Lower is better (0% ideal)</div>
1228
- </div>
1229
-
1230
- <div class="metric-card">
1231
- <div class="metric-icon">🌍</div>
1232
- <div class="metric-value">{metrics.arabic_fertility:.3f}</div>
1233
- <div class="metric-label">Arabic Fertility</div>
1234
- </div>
1235
-
1236
- <div class="metric-card">
1237
- <div class="metric-icon">⚡</div>
1238
- <div class="metric-value">{metrics.tokenization_time_ms:.2f}ms</div>
1239
- <div class="metric-label">Processing Time</div>
1240
- </div>
1241
- </div>
1242
- '''
1243
-
1244
- def generate_tokenizer_info_card(info: TokenizerInfo) -> str:
1245
- """Generate tokenizer information card"""
1246
-
1247
- dialect_badges = ''.join([f'<span class="badge dialect">{d}</span>' for d in info.dialect_support])
1248
- feature_badges = ''.join([f'<span class="badge feature">{f}</span>' for f in info.special_features])
1249
-
1250
- support_class = "native" if info.arabic_support == "Native" else "supported" if info.arabic_support == "Supported" else "limited"
1251
-
1252
- return f'''
1253
- <div class="info-card">
1254
- <div class="info-header">
1255
- <h3>{info.name}</h3>
1256
- <span class="org-badge">{info.organization}</span>
1257
- </div>
1258
-
1259
- <p class="description">{info.description}</p>
1260
-
1261
- <div class="info-grid">
1262
- <div class="info-item">
1263
- <span class="info-label">Type:</span>
1264
- <span class="info-value">{info.type.value}</span>
1265
- </div>
1266
- <div class="info-item">
1267
- <span class="info-label">Algorithm:</span>
1268
- <span class="info-value">{info.algorithm.value}</span>
1269
- </div>
1270
- <div class="info-item">
1271
- <span class="info-label">Vocab Size:</span>
1272
- <span class="info-value">{info.vocab_size:,}</span>
1273
- </div>
1274
- <div class="info-item">
1275
- <span class="info-label">Arabic Support:</span>
1276
- <span class="info-value support-{support_class}">{info.arabic_support}</span>
1277
- </div>
1278
- </div>
1279
-
1280
- <div class="badge-container">
1281
- <div class="badge-group">
1282
- <span class="badge-label">Dialects:</span>
1283
- {dialect_badges}
1284
- </div>
1285
- <div class="badge-group">
1286
- <span class="badge-label">Features:</span>
1287
- {feature_badges}
1288
- </div>
1289
- </div>
1290
- </div>
1291
- '''
1292
-
1293
- def analyze_single_tokenizer(tokenizer_choice: str, text: str) -> Tuple[str, str, str, str]:
1294
- """Analyze a single tokenizer"""
1295
-
1296
- if not text or not text.strip():
1297
- return (
1298
- '<div class="warning">⚠️ Please enter some text to analyze</div>',
1299
- '', '', ''
1300
- )
1301
-
1302
- if not tokenizer_choice:
1303
- return (
1304
- '<div class="warning">⚠️ Please select a tokenizer</div>',
1305
- '', '', ''
1306
- )
1307
-
1308
- model_id = tokenizer_manager.get_model_id_from_choice(tokenizer_choice)
1309
- tokenizer_info = tokenizer_manager.get_available_tokenizers().get(model_id)
1310
-
1311
- if not tokenizer_info:
1312
- return (
1313
- '<div class="error-card"><h4>Error</h4><p>Tokenizer not found</p></div>',
1314
- '', '', ''
1315
- )
1316
-
1317
- try:
1318
- metrics = analyze_tokenization(text, model_id, tokenizer_info)
1319
-
1320
- info_html = generate_tokenizer_info_card(tokenizer_info)
1321
- metrics_html = generate_metrics_card(metrics, tokenizer_info)
1322
- tokens_html = generate_token_visualization(metrics.tokens, metrics.token_ids)
1323
-
1324
- decoded_html = f'''
1325
- <div class="decoded-section">
1326
- <h4>Decoded Output</h4>
1327
- <div class="decoded-text" dir="auto">{metrics.decoded_text}</div>
1328
- <div class="decoded-meta">
1329
- Diacritics preserved: {'✅ Yes' if metrics.diacritic_preservation else '❌ No'}
1330
- </div>
1331
- </div>
1332
- '''
1333
-
1334
- return info_html, metrics_html, tokens_html, decoded_html
1335
-
1336
- except Exception as e:
1337
- return (
1338
- f'<div class="error-card"><h4>Error</h4><p>{str(e)}</p></div>',
1339
- '', '', ''
1340
- )
1341
-
1342
- def compare_tokenizers(tokenizer_choices: List[str], text: str) -> str:
1343
- """Compare multiple tokenizers"""
1344
-
1345
- if not text or not text.strip():
1346
- return '<div class="warning">⚠️ Please enter some text to analyze</div>'
1347
-
1348
- if not tokenizer_choices or len(tokenizer_choices) < 2:
1349
- return '<div class="warning">⚠️ Please select at least 2 tokenizers to compare</div>'
1350
-
1351
- results = []
1352
-
1353
- for choice in tokenizer_choices:
1354
- model_id = tokenizer_manager.get_model_id_from_choice(choice)
1355
- tokenizer_info = tokenizer_manager.get_available_tokenizers().get(model_id)
1356
-
1357
- if tokenizer_info:
1358
- try:
1359
- metrics = analyze_tokenization(text, model_id, tokenizer_info)
1360
- results.append({
1361
- 'name': tokenizer_info.name,
1362
- 'org': tokenizer_info.organization,
1363
- 'type': tokenizer_info.type.value,
1364
- 'metrics': metrics
1365
- })
1366
- except Exception as e:
1367
- results.append({
1368
- 'name': tokenizer_info.name,
1369
- 'org': tokenizer_info.organization,
1370
- 'type': tokenizer_info.type.value,
1371
- 'error': str(e)
1372
- })
1373
-
1374
- # Sort by fertility (lower is better)
1375
- results.sort(key=lambda x: x.get('metrics', TokenizationMetrics(
1376
- total_tokens=0, total_words=0, total_characters=0, total_bytes=0,
1377
- fertility=999, compression_ratio=0, char_per_token=0,
1378
- oov_count=0, oov_percentage=0, single_token_words=0,
1379
- single_token_retention_rate=0, avg_subwords_per_word=0,
1380
- max_subwords_per_word=0, continued_words_ratio=0,
1381
- arabic_char_count=0, arabic_token_count=0, arabic_fertility=0,
1382
- diacritic_preservation=False, tokenization_time_ms=0
1383
- )).fertility)
1384
-
1385
- # Generate comparison table
1386
- html = '''
1387
- <div class="comparison-container">
1388
- <table class="comparison-table">
1389
- <thead>
1390
- <tr>
1391
- <th>Rank</th>
1392
- <th>Tokenizer</th>
1393
- <th>Type</th>
1394
- <th>Tokens</th>
1395
- <th>Fertility ↓</th>
1396
- <th>Compression ↑</th>
1397
- <th>STRR ↑</th>
1398
- <th>OOV %</th>
1399
- </tr>
1400
- </thead>
1401
- <tbody>
1402
- '''
1403
-
1404
- for i, result in enumerate(results):
1405
- rank = i + 1
1406
- rank_class = 'rank-1' if rank == 1 else 'rank-2' if rank == 2 else 'rank-3' if rank == 3 else ''
1407
-
1408
- if 'error' in result:
1409
- html += f'''
1410
- <tr class="{rank_class}">
1411
- <td>#{rank}</td>
1412
- <td><strong>{result['name']}</strong><br><small>{result['org']}</small></td>
1413
- <td>{result['type']}</td>
1414
- <td colspan="5" class="error">Error: {result['error']}</td>
1415
- </tr>
1416
- '''
1417
- else:
1418
- m = result['metrics']
1419
- fertility_class = 'excellent' if m.fertility < 1.5 else 'good' if m.fertility < 2.5 else 'poor'
1420
-
1421
- html += f'''
1422
- <tr class="{rank_class}">
1423
- <td><strong>#{rank}</strong></td>
1424
- <td><strong>{result['name']}</strong><br><small>{result['org']}</small></td>
1425
- <td>{result['type']}</td>
1426
- <td>{m.total_tokens}</td>
1427
- <td class="{fertility_class}">{m.fertility:.3f}</td>
1428
- <td>{m.compression_ratio:.2f}</td>
1429
- <td>{m.single_token_retention_rate:.1%}</td>
1430
- <td>{m.oov_percentage:.1f}%</td>
1431
- </tr>
1432
- '''
1433
-
1434
- html += '''
1435
- </tbody>
1436
- </table>
1437
- </div>
1438
- '''
1439
-
1440
- return html
1441
-
1442
- # ============================================================================
1443
- # CUSTOM CSS
1444
- # ============================================================================
1445
-
1446
- CUSTOM_CSS = """
1447
- /* ===== ROOT VARIABLES ===== */
1448
- :root {
1449
- --primary: #1a5f2a;
1450
- --primary-light: #2d8f4e;
1451
- --secondary: #4a90d9;
1452
- --accent: #f59e0b;
1453
- --success: #10b981;
1454
- --warning: #f57c00;
1455
- --error: #c62828;
1456
- --bg-primary: #0f1419;
1457
- --bg-secondary: #1c2128;
1458
- --bg-card: #22272e;
1459
- --text-primary: #e6edf3;
1460
- --text-secondary: #8b949e;
1461
- --border: #30363d;
1462
- }
1463
-
1464
- /* ===== HEADER ===== */
1465
- .header-section {
1466
- text-align: center;
1467
- padding: 2rem 1rem;
1468
- background: linear-gradient(135deg, var(--primary) 0%, var(--primary-light) 100%);
1469
- border-radius: 16px;
1470
- margin-bottom: 1.5rem;
1471
- }
1472
-
1473
- .header-section h1 {
1474
- font-size: 2.5rem;
1475
- color: white;
1476
- margin-bottom: 0.5rem;
1477
- }
1478
-
1479
- .header-section p {
1480
- color: rgba(255,255,255,0.9);
1481
- font-size: 1.1rem;
1482
- }
1483
-
1484
- /* ===== INFO CARD ===== */
1485
- .info-card {
1486
- background: var(--bg-card);
1487
- border-radius: 12px;
1488
- padding: 1.5rem;
1489
- border: 1px solid var(--border);
1490
- }
1491
-
1492
- .info-header {
1493
- display: flex;
1494
- justify-content: space-between;
1495
- align-items: center;
1496
- margin-bottom: 1rem;
1497
- }
1498
-
1499
- .info-header h3 {
1500
- color: var(--text-primary);
1501
- margin: 0;
1502
- }
1503
-
1504
- .org-badge {
1505
- background: var(--primary);
1506
- color: white;
1507
- padding: 0.25rem 0.75rem;
1508
- border-radius: 20px;
1509
- font-size: 0.85rem;
1510
- }
1511
-
1512
- .description {
1513
- color: var(--text-secondary);
1514
- line-height: 1.6;
1515
- }
1516
-
1517
- .info-grid {
1518
- display: grid;
1519
- grid-template-columns: repeat(2, 1fr);
1520
- gap: 1rem;
1521
- margin: 1rem 0;
1522
- }
1523
-
1524
- .info-item {
1525
- display: flex;
1526
- flex-direction: column;
1527
- }
1528
-
1529
- .info-label {
1530
- color: var(--text-secondary);
1531
- font-size: 0.85rem;
1532
- }
1533
-
1534
- .info-value {
1535
- color: var(--text-primary);
1536
- font-weight: 600;
1537
- }
1538
-
1539
- .support-native { color: var(--success); }
1540
- .support-supported { color: var(--secondary); }
1541
- .support-limited { color: var(--warning); }
1542
-
1543
- /* ===== BADGES ===== */
1544
- .badge-container {
1545
- margin-top: 1rem;
1546
- }
1547
-
1548
- .badge-group {
1549
- margin-bottom: 0.5rem;
1550
- }
1551
-
1552
- .badge-label {
1553
- color: var(--text-secondary);
1554
- font-size: 0.85rem;
1555
- margin-right: 0.5rem;
1556
- }
1557
-
1558
- .badge {
1559
- display: inline-block;
1560
- padding: 0.2rem 0.5rem;
1561
- border-radius: 4px;
1562
- font-size: 0.75rem;
1563
- margin-right: 0.25rem;
1564
- margin-bottom: 0.25rem;
1565
- }
1566
-
1567
- .badge.dialect {
1568
- background: rgba(74, 144, 217, 0.2);
1569
- color: var(--secondary);
1570
- }
1571
-
1572
- .badge.feature {
1573
- background: rgba(245, 158, 11, 0.2);
1574
- color: var(--accent);
1575
- }
1576
-
1577
- /* ===== METRICS GRID ===== */
1578
- .metrics-grid {
1579
- display: grid;
1580
- grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
1581
- gap: 1rem;
1582
- margin: 1rem 0;
1583
- }
1584
-
1585
- .metric-card {
1586
- background: var(--bg-card);
1587
- border-radius: 12px;
1588
- padding: 1rem;
1589
- text-align: center;
1590
- border: 1px solid var(--border);
1591
- transition: transform 0.2s;
1592
- }
1593
-
1594
- .metric-card:hover {
1595
- transform: translateY(-2px);
1596
- }
1597
-
1598
- .metric-card.excellent {
1599
- border-color: var(--success);
1600
- background: linear-gradient(to bottom, rgba(16, 185, 129, 0.1), transparent);
1601
- }
1602
-
1603
- .metric-card.good {
1604
- border-color: var(--secondary);
1605
- background: linear-gradient(to bottom, rgba(74, 144, 217, 0.1), transparent);
1606
- }
1607
-
1608
- .metric-card.poor {
1609
- border-color: var(--error);
1610
- background: linear-gradient(to bottom, rgba(198, 40, 40, 0.1), transparent);
1611
- }
1612
-
1613
- .metric-card.primary {
1614
- border-color: var(--primary);
1615
- background: linear-gradient(to bottom, rgba(26, 95, 42, 0.1), transparent);
1616
- }
1617
-
1618
- .metric-icon {
1619
- font-size: 1.5rem;
1620
- margin-bottom: 0.5rem;
1621
- }
1622
-
1623
- .metric-value {
1624
- font-size: 1.5rem;
1625
- font-weight: 700;
1626
- color: var(--text-primary);
1627
- }
1628
-
1629
- .metric-label {
1630
- font-size: 0.8rem;
1631
- color: var(--text-secondary);
1632
- margin-top: 0.25rem;
1633
- }
1634
-
1635
- .metric-hint {
1636
- font-size: 0.7rem;
1637
- color: var(--text-secondary);
1638
- opacity: 0.7;
1639
- }
1640
-
1641
- /* ===== TOKEN VISUALIZATION ===== */
1642
- .token-container {
1643
- display: flex;
1644
- flex-wrap: wrap;
1645
- gap: 0.5rem;
1646
- padding: 1rem;
1647
- background: var(--bg-secondary);
1648
- border-radius: 12px;
1649
- direction: rtl;
1650
- }
1651
-
1652
- .token {
1653
- display: inline-flex;
1654
- flex-direction: column;
1655
- align-items: center;
1656
- padding: 0.5rem 0.75rem;
1657
- border-radius: 8px;
1658
- font-family: 'IBM Plex Sans Arabic', monospace;
1659
- font-size: 1rem;
1660
- transition: transform 0.2s;
1661
- cursor: default;
1662
- }
1663
-
1664
- .token:hover {
1665
- transform: scale(1.05);
1666
- }
1667
-
1668
- .token-id {
1669
- font-size: 0.65rem;
1670
- opacity: 0.7;
1671
- margin-top: 0.25rem;
1672
- }
1673
-
1674
- /* ===== DECODED SECTION ===== */
1675
- .decoded-section {
1676
- background: var(--bg-card);
1677
- border-radius: 12px;
1678
- padding: 1.5rem;
1679
- border: 1px solid var(--border);
1680
- }
1681
-
1682
- .decoded-section h4 {
1683
- color: var(--text-primary);
1684
- margin-bottom: 1rem;
1685
- }
1686
-
1687
- .decoded-text {
1688
- font-family: 'IBM Plex Sans Arabic', serif;
1689
- font-size: 1.1rem;
1690
- line-height: 1.8;
1691
- color: var(--text-primary);
1692
- }
1693
-
1694
- .decoded-meta {
1695
- margin-top: 1rem;
1696
- font-size: 0.85rem;
1697
- color: var(--text-secondary);
1698
- }
1699
-
1700
- /* ===== COMPARISON TABLE ===== */
1701
- .comparison-container {
1702
- overflow-x: auto;
1703
- }
1704
-
1705
- .comparison-table {
1706
- width: 100%;
1707
- border-collapse: collapse;
1708
- margin: 1rem 0;
1709
- }
1710
-
1711
- .comparison-table th {
1712
- background: var(--primary);
1713
- color: white;
1714
- padding: 0.75rem;
1715
- text-align: left;
1716
- font-weight: 600;
1717
- }
1718
-
1719
- .comparison-table td {
1720
- padding: 0.75rem;
1721
- border-bottom: 1px solid var(--border);
1722
- color: var(--text-primary);
1723
- }
1724
-
1725
- .comparison-table tr:hover {
1726
- background: rgba(74, 144, 217, 0.1);
1727
- }
1728
-
1729
- .comparison-table .rank-1 {
1730
- background: linear-gradient(90deg, rgba(255, 215, 0, 0.2), transparent);
1731
- }
1732
-
1733
- .comparison-table .rank-2 {
1734
- background: linear-gradient(90deg, rgba(192, 192, 192, 0.2), transparent);
1735
- }
1736
-
1737
- .comparison-table .rank-3 {
1738
- background: linear-gradient(90deg, rgba(205, 127, 50, 0.2), transparent);
1739
- }
1740
-
1741
- .comparison-table .excellent {
1742
- color: var(--success);
1743
- font-weight: 600;
1744
- }
1745
-
1746
- .comparison-table .good {
1747
- color: var(--secondary);
1748
- }
1749
-
1750
- .comparison-table .poor {
1751
- color: var(--error);
1752
- }
1753
-
1754
- /* ===== UTILITY CLASSES ===== */
1755
- .warning {
1756
- background: linear-gradient(to right, rgba(245, 124, 0, 0.1), transparent);
1757
- border-left: 4px solid var(--warning);
1758
- padding: 1rem;
1759
- border-radius: 0 8px 8px 0;
1760
- color: var(--text-primary);
1761
- }
1762
-
1763
- .error-card {
1764
- background: linear-gradient(to right, rgba(198, 40, 40, 0.1), transparent);
1765
- border-left: 4px solid var(--error);
1766
- padding: 1rem;
1767
- border-radius: 0 8px 8px 0;
1768
- }
1769
-
1770
- .error-card h4 {
1771
- color: var(--error);
1772
- margin-bottom: 0.5rem;
1773
- }
1774
-
1775
- .error-card p {
1776
- color: var(--text-secondary);
1777
- }
1778
- """
1779
-
1780
- # ============================================================================
1781
- # SAMPLE TEXTS FOR TESTING
1782
- # ============================================================================
1783
-
1784
- SAMPLE_TEXTS = {
1785
- "MSA News": "أعلنت وزارة التربية والتعليم عن بدء العام الدراسي الجديد في الأول من سبتمبر، حيث ستعود المدارس لاستقبال الطلاب بعد العطلة الصيفية الطويلة.",
1786
- "MSA Formal": "إن تطوير تقنيات الذكاء الاصطناعي يمثل نقلة نوعية في مجال معالجة اللغات الطبيعية، وخاصة فيما يتعلق باللغة العربية ذات الخصائص المورفولوجية الغنية.",
1787
- "Egyptian Dialect": "ازيك يا صاحبي؟ إيه أخبارك؟ عامل إيه النهارده؟ قولي هنروح فين بكره؟",
1788
- "Gulf Dialect": "شلونك؟ شخبارك؟ وش تسوي الحين؟ ودك تروح وياي للسوق؟",
1789
- "Levantine Dialect": "كيفك؟ شو أخبارك؟ شو عم تعمل هلق؟ بدك تيجي معي على السوق؟",
1790
- "Classical Arabic (Quran)": "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ ۝ الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ",
1791
- "Poetry": "وما من كاتبٍ إلا سيفنى ويُبقي الدهرُ ما كتبت يداهُ",
1792
- "Technical": "يستخدم نموذج المحولات آلية الانتباه الذاتي لمعالجة تسلسلات النصوص بشكل متوازي.",
1793
- "Mixed Arabic-English": "The Arabic language العربية is a Semitic language with over 400 million speakers worldwide.",
1794
- "With Diacritics": "إِنَّ اللَّهَ وَمَلَائِكَتَهُ يُصَلُّونَ عَلَى النَّبِيِّ",
1795
- }
1796
-
1797
- # ============================================================================
1798
- # GRADIO INTERFACE
1799
- # ============================================================================
1800
 
1801
  def create_interface():
1802
  """Create the Gradio interface"""
1803
 
1804
  available_tokenizers = tokenizer_manager.get_tokenizer_choices()
1805
-
1806
- # Group tokenizers by type
1807
- arabic_specific = [t for t in available_tokenizers if any(x in t for x in ['AraBERT', 'CAMeL', 'MARBERT', 'ARBERT', 'Aranizer'])]
1808
- arabic_llms = [t for t in available_tokenizers if any(x in t for x in ['Jais', 'AceGPT', 'ALLaM', 'SILMA', 'Fanar', 'Yehia', 'Atlas'])]
1809
- multilingual = [t for t in available_tokenizers if t not in arabic_specific and t not in arabic_llms]
1810
-
1811
- with gr.Blocks(css=CUSTOM_CSS, title="Arabic Tokenizer Arena Pro", theme=gr.themes.Base(
1812
- primary_hue="green",
1813
- secondary_hue="blue",
1814
- neutral_hue="slate",
1815
- font=["IBM Plex Sans Arabic", "system-ui", "sans-serif"]
1816
- )) as demo:
1817
 
1818
  # Header
1819
  gr.HTML("""
@@ -1909,7 +128,7 @@ def create_interface():
1909
  outputs=[comparison_output]
1910
  )
1911
 
1912
- # ===== TAB 3: LEADERBOARD - Real HF Datasets =====
1913
  with gr.TabItem("🏆 Leaderboard", id="leaderboard"):
1914
  gr.Markdown("""
1915
  ## 🏆 Arabic Tokenizer Leaderboard
@@ -1960,16 +179,16 @@ def create_interface():
1960
  ---
1961
  ### 📖 Dataset Sources (from HuggingFace)
1962
 
1963
- | Dataset | HuggingFace ID | Category | Description |
1964
- |---------|----------------|----------|-------------|
1965
- | ArabicMMLU | `MBZUAI/ArabicMMLU` | Benchmark | Multi-task exam questions (14,575 MCQs) |
1966
- | ArSenTD-LEV | `ramybaly/arsentd_lev` | Dialectal | Levantine tweets |
1967
- | ATHAR | `mohamed-khalil/ATHAR` | Classical | 66K classical Arabic sentences |
1968
- | ARCD | `arcd` | QA | Arabic Reading Comprehension |
1969
- | Ashaar | `arbml/Ashaar_dataset` | Poetry | 2M+ Arabic poetry verses |
1970
- | Hadith | `gurgutan/sunnah_ar_en_dataset` | Religious | 50,762 hadiths |
1971
- | Arabic Sentiment | `arbml/Arabic_Sentiment_Twitter_Corpus` | Social Media | Twitter sentiment |
1972
- | SANAD | `arbml/SANAD` | News | Arabic news articles |
1973
  """)
1974
 
1975
  # ===== TAB 4: Metrics Reference =====
@@ -2000,6 +219,17 @@ def create_interface():
2000
  | **Arabic Fertility** | Tokens per Arabic word | Arabic-specific efficiency measure |
2001
  | **Diacritic Preservation** | Whether tashkeel is preserved | Important for religious & educational texts |
2002
 
 
 
 
 
 
 
 
 
 
 
 
2003
  ### Research Background
2004
 
2005
  These metrics are based on recent research including:
@@ -2011,50 +241,19 @@ def create_interface():
2011
 
2012
  # ===== TAB 5: About =====
2013
  with gr.TabItem("ℹ️ About", id="about"):
2014
- gr.Markdown(f"""
2015
- ## Arabic Tokenizer Arena Pro
2016
-
2017
- A comprehensive platform for evaluating Arabic tokenizers across multiple dimensions.
2018
-
2019
- ### Available Tokenizers: {len(available_tokenizers)}
2020
-
2021
- **Arabic-Specific Models:**
2022
- {chr(10).join(['- ' + t for t in arabic_specific[:10]])}
2023
-
2024
- **Arabic LLMs:**
2025
- {chr(10).join(['- ' + t for t in arabic_llms[:10]])}
2026
-
2027
- **Multilingual LLMs:**
2028
- {chr(10).join(['- ' + t for t in multilingual[:10]])}
2029
-
2030
- ### Features
2031
-
2032
- ✅ Comprehensive efficiency metrics (fertility, compression, STRR)
2033
- ✅ Arabic-specific analysis (dialect support, diacritic preservation)
2034
- ✅ Side-by-side tokenizer comparison
2035
- ✅ Beautiful token visualization
2036
- ✅ **NEW: Leaderboard with real HuggingFace datasets**
2037
- ✅ Support for MSA, dialectal Arabic, and Classical Arabic
2038
- ✅ Research-backed evaluation methodology
2039
-
2040
- ### Use Cases
2041
-
2042
- - **Research**: Compare tokenizers for Arabic NLP experiments
2043
- - **Production**: Select optimal tokenizer for deployment
2044
- - **Education**: Understand how different algorithms handle Arabic
2045
- - **Optimization**: Identify cost-efficient tokenizers for API usage
2046
-
2047
- ---
2048
-
2049
- Built with ❤️ for the Arabic NLP community
2050
- """)
2051
 
2052
  return demo
2053
 
 
2054
  # ============================================================================
2055
  # MAIN
2056
  # ============================================================================
2057
 
2058
  if __name__ == "__main__":
2059
  demo = create_interface()
2060
- demo.launch()
 
1
  """
2
+ Arabic Tokenizer Arena Pro - Main Application
3
+ ==============================================
4
+ Advanced research & production platform for Arabic tokenization analysis
 
5
 
6
+ Run with: python app.py
 
 
 
 
 
 
7
  """
8
 
9
  import gradio as gr
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
+ # Import modules
12
+ from config import SAMPLE_TEXTS, LEADERBOARD_DATASETS
13
+ from styles import CUSTOM_CSS
14
+ from tokenizer_manager import tokenizer_manager
15
+ from analysis import analyze_single_tokenizer, compare_tokenizers
16
+ from leaderboard import run_leaderboard_evaluation
17
+ from ui_components import generate_about_html
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  def create_interface():
21
  """Create the Gradio interface"""
22
 
23
  available_tokenizers = tokenizer_manager.get_tokenizer_choices()
24
+ tokenizers_by_type = tokenizer_manager.get_tokenizers_by_type()
25
+
26
+ with gr.Blocks(
27
+ css=CUSTOM_CSS,
28
+ title="Arabic Tokenizer Arena Pro",
29
+ theme=gr.themes.Base(
30
+ primary_hue="green",
31
+ secondary_hue="blue",
32
+ neutral_hue="slate",
33
+ font=["IBM Plex Sans Arabic", "system-ui", "sans-serif"]
34
+ )
35
+ ) as demo:
36
 
37
  # Header
38
  gr.HTML("""
 
128
  outputs=[comparison_output]
129
  )
130
 
131
+ # ===== TAB 3: LEADERBOARD =====
132
  with gr.TabItem("🏆 Leaderboard", id="leaderboard"):
133
  gr.Markdown("""
134
  ## 🏆 Arabic Tokenizer Leaderboard
 
179
  ---
180
  ### 📖 Dataset Sources (from HuggingFace)
181
 
182
+ | Dataset | HuggingFace ID | Category | Samples |
183
+ |---------|----------------|----------|---------|
184
+ | ArabicMMLU | `MBZUAI/ArabicMMLU` | MSA Benchmark | 500 |
185
+ | ArSenTD-LEV | `ramybaly/arsentd_lev` | Levantine Dialect | 500 |
186
+ | ATHAR | `mohamed-khalil/ATHAR` | Classical Arabic | 500 |
187
+ | ARCD | `arcd` | QA Dataset | 300 |
188
+ | Ashaar | `arbml/Ashaar_dataset` | Poetry | 500 |
189
+ | Hadith | `gurgutan/sunnah_ar_en_dataset` | Religious | 400 |
190
+ | Arabic Sentiment | `arbml/Arabic_Sentiment_Twitter_Corpus` | Social Media | 500 |
191
+ | SANAD | `arbml/SANAD` | News | 400 |
192
  """)
193
 
194
  # ===== TAB 4: Metrics Reference =====
 
219
  | **Arabic Fertility** | Tokens per Arabic word | Arabic-specific efficiency measure |
220
  | **Diacritic Preservation** | Whether tashkeel is preserved | Important for religious & educational texts |
221
 
222
+ ### Scoring Formula (Leaderboard)
223
+
224
+ ```
225
+ Score = (Fertility Score × 0.45) + (Compression Score × 0.35) + (UNK Score × 0.20) × 100
226
+ ```
227
+
228
+ Where:
229
+ - **Fertility Score** = 2.0 / fertility (capped 0-1, inverted - lower fertility = higher score)
230
+ - **Compression Score** = compression / 6 (capped 0-1)
231
+ - **UNK Score** = 1 - (unk_ratio × 20) (capped 0-1, inverted)
232
+
233
  ### Research Background
234
 
235
  These metrics are based on recent research including:
 
241
 
242
  # ===== TAB 5: About =====
243
  with gr.TabItem("ℹ️ About", id="about"):
244
+ about_html = generate_about_html(
245
+ tokenizers_by_type,
246
+ len(available_tokenizers)
247
+ )
248
+ gr.HTML(about_html)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
 
250
  return demo
251
 
252
+
253
  # ============================================================================
254
  # MAIN
255
  # ============================================================================
256
 
257
  if __name__ == "__main__":
258
  demo = create_interface()
259
+ demo.launch()
config.py ADDED
@@ -0,0 +1,551 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration for Arabic Tokenizer Arena
3
+ =========================================
4
+ Tokenizer registry, dataset configs, and sample texts
5
+ """
6
+
7
+ from dataclasses import dataclass, field
8
+ from typing import List, Dict
9
+ from enum import Enum
10
+
11
+
12
+ class TokenizerType(Enum):
13
+ ARABIC_SPECIFIC = "Arabic-Specific"
14
+ MULTILINGUAL_LLM = "Multilingual LLM"
15
+ ARABIC_LLM = "Arabic LLM"
16
+ ENCODER_ONLY = "Encoder-Only (BERT)"
17
+ DECODER_ONLY = "Decoder-Only (GPT)"
18
+
19
+
20
+ class TokenizerAlgorithm(Enum):
21
+ BPE = "Byte-Pair Encoding (BPE)"
22
+ BBPE = "Byte-Level BPE"
23
+ WORDPIECE = "WordPiece"
24
+ SENTENCEPIECE = "SentencePiece"
25
+ UNIGRAM = "Unigram"
26
+ TIKTOKEN = "Tiktoken"
27
+
28
+
29
+ @dataclass
30
+ class TokenizerInfo:
31
+ """Metadata about a tokenizer"""
32
+ name: str
33
+ model_id: str
34
+ type: TokenizerType
35
+ algorithm: TokenizerAlgorithm
36
+ vocab_size: int
37
+ description: str
38
+ organization: str
39
+ arabic_support: str # Native, Adapted, Limited
40
+ dialect_support: List[str] = field(default_factory=list)
41
+ special_features: List[str] = field(default_factory=list)
42
+
43
+
44
+ @dataclass
45
+ class TokenizationMetrics:
46
+ """Comprehensive tokenization evaluation metrics"""
47
+ total_tokens: int
48
+ total_words: int
49
+ total_characters: int
50
+ total_bytes: int
51
+ fertility: float
52
+ compression_ratio: float
53
+ char_per_token: float
54
+ oov_count: int
55
+ oov_percentage: float
56
+ single_token_words: int
57
+ single_token_retention_rate: float
58
+ avg_subwords_per_word: float
59
+ max_subwords_per_word: int
60
+ continued_words_ratio: float
61
+ arabic_char_count: int
62
+ arabic_token_count: int
63
+ arabic_fertility: float
64
+ diacritic_preservation: bool
65
+ tokenization_time_ms: float
66
+ tokens: List[str] = field(default_factory=list)
67
+ token_ids: List[int] = field(default_factory=list)
68
+ decoded_text: str = ""
69
+
70
+
71
+ # ============================================================================
72
+ # TOKENIZER REGISTRY
73
+ # ============================================================================
74
+
75
+ TOKENIZER_REGISTRY: Dict[str, TokenizerInfo] = {
76
+ # ========== ARABIC-SPECIFIC BERT MODELS ==========
77
+ "aubmindlab/bert-base-arabertv2": TokenizerInfo(
78
+ name="AraBERT v2",
79
+ model_id="aubmindlab/bert-base-arabertv2",
80
+ type=TokenizerType.ENCODER_ONLY,
81
+ algorithm=TokenizerAlgorithm.WORDPIECE,
82
+ vocab_size=64000,
83
+ description="Arabic BERT with Farasa segmentation, optimized for MSA",
84
+ organization="AUB MIND Lab",
85
+ arabic_support="Native",
86
+ dialect_support=["MSA"],
87
+ special_features=["Farasa preprocessing", "Morphological segmentation"]
88
+ ),
89
+ "aubmindlab/bert-large-arabertv2": TokenizerInfo(
90
+ name="AraBERT v2 Large",
91
+ model_id="aubmindlab/bert-large-arabertv2",
92
+ type=TokenizerType.ENCODER_ONLY,
93
+ algorithm=TokenizerAlgorithm.WORDPIECE,
94
+ vocab_size=64000,
95
+ description="Large Arabic BERT with enhanced capacity",
96
+ organization="AUB MIND Lab",
97
+ arabic_support="Native",
98
+ dialect_support=["MSA"],
99
+ special_features=["Large model", "Farasa preprocessing"]
100
+ ),
101
+ "CAMeL-Lab/bert-base-arabic-camelbert-mix": TokenizerInfo(
102
+ name="CAMeLBERT Mix",
103
+ model_id="CAMeL-Lab/bert-base-arabic-camelbert-mix",
104
+ type=TokenizerType.ENCODER_ONLY,
105
+ algorithm=TokenizerAlgorithm.WORDPIECE,
106
+ vocab_size=30000,
107
+ description="Pre-trained on MSA, DA, and Classical Arabic mix",
108
+ organization="CAMeL Lab NYU Abu Dhabi",
109
+ arabic_support="Native",
110
+ dialect_support=["MSA", "DA", "CA"],
111
+ special_features=["Multi-variant Arabic", "Classical Arabic support"]
112
+ ),
113
+ "CAMeL-Lab/bert-base-arabic-camelbert-msa": TokenizerInfo(
114
+ name="CAMeLBERT MSA",
115
+ model_id="CAMeL-Lab/bert-base-arabic-camelbert-msa",
116
+ type=TokenizerType.ENCODER_ONLY,
117
+ algorithm=TokenizerAlgorithm.WORDPIECE,
118
+ vocab_size=30000,
119
+ description="Specialized for Modern Standard Arabic",
120
+ organization="CAMeL Lab NYU Abu Dhabi",
121
+ arabic_support="Native",
122
+ dialect_support=["MSA"],
123
+ special_features=["MSA optimized"]
124
+ ),
125
+ "CAMeL-Lab/bert-base-arabic-camelbert-da": TokenizerInfo(
126
+ name="CAMeLBERT DA",
127
+ model_id="CAMeL-Lab/bert-base-arabic-camelbert-da",
128
+ type=TokenizerType.ENCODER_ONLY,
129
+ algorithm=TokenizerAlgorithm.WORDPIECE,
130
+ vocab_size=30000,
131
+ description="Specialized for Dialectal Arabic",
132
+ organization="CAMeL Lab NYU Abu Dhabi",
133
+ arabic_support="Native",
134
+ dialect_support=["Egyptian", "Gulf", "Levantine", "Maghrebi"],
135
+ special_features=["Dialect optimized"]
136
+ ),
137
+ "CAMeL-Lab/bert-base-arabic-camelbert-ca": TokenizerInfo(
138
+ name="CAMeLBERT CA",
139
+ model_id="CAMeL-Lab/bert-base-arabic-camelbert-ca",
140
+ type=TokenizerType.ENCODER_ONLY,
141
+ algorithm=TokenizerAlgorithm.WORDPIECE,
142
+ vocab_size=30000,
143
+ description="Specialized for Classical Arabic",
144
+ organization="CAMeL Lab NYU Abu Dhabi",
145
+ arabic_support="Native",
146
+ dialect_support=["Classical"],
147
+ special_features=["Classical Arabic", "Religious texts"]
148
+ ),
149
+ "UBC-NLP/MARBERT": TokenizerInfo(
150
+ name="MARBERT",
151
+ model_id="UBC-NLP/MARBERT",
152
+ type=TokenizerType.ENCODER_ONLY,
153
+ algorithm=TokenizerAlgorithm.WORDPIECE,
154
+ vocab_size=100000,
155
+ description="Multi-dialectal Arabic BERT trained on Twitter data",
156
+ organization="UBC NLP",
157
+ arabic_support="Native",
158
+ dialect_support=["MSA", "Egyptian", "Gulf", "Levantine", "Maghrebi"],
159
+ special_features=["Twitter data", "100K vocabulary", "Multi-dialect"]
160
+ ),
161
+ "UBC-NLP/ARBERT": TokenizerInfo(
162
+ name="ARBERT",
163
+ model_id="UBC-NLP/ARBERT",
164
+ type=TokenizerType.ENCODER_ONLY,
165
+ algorithm=TokenizerAlgorithm.WORDPIECE,
166
+ vocab_size=100000,
167
+ description="Arabic BERT focused on MSA with large vocabulary",
168
+ organization="UBC NLP",
169
+ arabic_support="Native",
170
+ dialect_support=["MSA"],
171
+ special_features=["100K vocabulary", "MSA focused"]
172
+ ),
173
+ "asafaya/bert-base-arabic": TokenizerInfo(
174
+ name="Arabic BERT (Safaya)",
175
+ model_id="asafaya/bert-base-arabic",
176
+ type=TokenizerType.ENCODER_ONLY,
177
+ algorithm=TokenizerAlgorithm.WORDPIECE,
178
+ vocab_size=32000,
179
+ description="Arabic BERT trained on MSA and dialectal Arabic",
180
+ organization="Safaya",
181
+ arabic_support="Native",
182
+ dialect_support=["MSA", "DA"],
183
+ special_features=["TPU trained", "Dialect support"]
184
+ ),
185
+
186
+ # ========== ARABIC-SPECIFIC TOKENIZERS ==========
187
+ "riotu-lab/Aranizer-PBE-86k": TokenizerInfo(
188
+ name="Aranizer PBE 86K",
189
+ model_id="riotu-lab/Aranizer-PBE-86k",
190
+ type=TokenizerType.ARABIC_SPECIFIC,
191
+ algorithm=TokenizerAlgorithm.BPE,
192
+ vocab_size=86000,
193
+ description="Pair Byte Encoding tokenizer optimized for Arabic LLMs",
194
+ organization="RIOTU Lab",
195
+ arabic_support="Native",
196
+ dialect_support=["MSA"],
197
+ special_features=["Low fertility", "LLM optimized", "86K vocab"]
198
+ ),
199
+ "riotu-lab/Aranizer-SP-86k": TokenizerInfo(
200
+ name="Aranizer SP 86K",
201
+ model_id="riotu-lab/Aranizer-SP-86k",
202
+ type=TokenizerType.ARABIC_SPECIFIC,
203
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
204
+ vocab_size=86000,
205
+ description="SentencePiece tokenizer optimized for Arabic",
206
+ organization="RIOTU Lab",
207
+ arabic_support="Native",
208
+ dialect_support=["MSA"],
209
+ special_features=["Low fertility", "SentencePiece", "86K vocab"]
210
+ ),
211
+ "riotu-lab/Aranizer-PBE-32k": TokenizerInfo(
212
+ name="Aranizer PBE 32K",
213
+ model_id="riotu-lab/Aranizer-PBE-32k",
214
+ type=TokenizerType.ARABIC_SPECIFIC,
215
+ algorithm=TokenizerAlgorithm.BPE,
216
+ vocab_size=32000,
217
+ description="Compact PBE tokenizer for Arabic",
218
+ organization="RIOTU Lab",
219
+ arabic_support="Native",
220
+ dialect_support=["MSA"],
221
+ special_features=["Compact", "LLM compatible"]
222
+ ),
223
+ "riotu-lab/Aranizer-SP-32k": TokenizerInfo(
224
+ name="Aranizer SP 32K",
225
+ model_id="riotu-lab/Aranizer-SP-32k",
226
+ type=TokenizerType.ARABIC_SPECIFIC,
227
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
228
+ vocab_size=32000,
229
+ description="Compact SentencePiece tokenizer for Arabic",
230
+ organization="RIOTU Lab",
231
+ arabic_support="Native",
232
+ dialect_support=["MSA"],
233
+ special_features=["Compact", "Efficient"]
234
+ ),
235
+
236
+ # ========== ARABIC LLMs ==========
237
+ "inception-mbzuai/jais-13b": TokenizerInfo(
238
+ name="Jais 13B",
239
+ model_id="inception-mbzuai/jais-13b",
240
+ type=TokenizerType.ARABIC_LLM,
241
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
242
+ vocab_size=84992,
243
+ description="World's most advanced Arabic LLM, trained from scratch",
244
+ organization="Inception/MBZUAI",
245
+ arabic_support="Native",
246
+ dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
247
+ special_features=["Arabic-first", "Lowest fertility", "UAE-native"]
248
+ ),
249
+ "inceptionai/jais-family-30b-8k-chat": TokenizerInfo(
250
+ name="Jais 30B Chat",
251
+ model_id="inceptionai/jais-family-30b-8k-chat",
252
+ type=TokenizerType.ARABIC_LLM,
253
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
254
+ vocab_size=84992,
255
+ description="Enhanced 30B version with chat capabilities",
256
+ organization="Inception AI",
257
+ arabic_support="Native",
258
+ dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
259
+ special_features=["30B parameters", "Chat optimized", "8K context"]
260
+ ),
261
+ "FreedomIntelligence/AceGPT-13B-chat": TokenizerInfo(
262
+ name="AceGPT 13B Chat",
263
+ model_id="FreedomIntelligence/AceGPT-13B-chat",
264
+ type=TokenizerType.ARABIC_LLM,
265
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
266
+ vocab_size=32000,
267
+ description="Arabic-enhanced LLaMA with cultural alignment and chat",
268
+ organization="Freedom Intelligence",
269
+ arabic_support="Adapted",
270
+ dialect_support=["MSA"],
271
+ special_features=["LLaMA-based", "Cultural alignment", "RLHF", "Chat"]
272
+ ),
273
+ "silma-ai/SILMA-9B-Instruct-v1.0": TokenizerInfo(
274
+ name="SILMA 9B Instruct",
275
+ model_id="silma-ai/SILMA-9B-Instruct-v1.0",
276
+ type=TokenizerType.ARABIC_LLM,
277
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
278
+ vocab_size=256000,
279
+ description="Top-ranked Arabic LLM based on Gemma, outperforms larger models",
280
+ organization="SILMA AI",
281
+ arabic_support="Native",
282
+ dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
283
+ special_features=["Gemma-based", "SOTA 9B class", "Efficient"]
284
+ ),
285
+ "silma-ai/SILMA-Kashif-2B-Instruct-v1.0": TokenizerInfo(
286
+ name="SILMA Kashif 2B (RAG)",
287
+ model_id="silma-ai/SILMA-Kashif-2B-Instruct-v1.0",
288
+ type=TokenizerType.ARABIC_LLM,
289
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
290
+ vocab_size=256000,
291
+ description="RAG-optimized Arabic model, excellent for context-based QA",
292
+ organization="SILMA AI",
293
+ arabic_support="Native",
294
+ dialect_support=["MSA"],
295
+ special_features=["RAG optimized", "12K context", "Compact"]
296
+ ),
297
+ "QCRI/Fanar-1-9B-Instruct": TokenizerInfo(
298
+ name="Fanar 9B Instruct",
299
+ model_id="QCRI/Fanar-1-9B-Instruct",
300
+ type=TokenizerType.ARABIC_LLM,
301
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
302
+ vocab_size=256000,
303
+ description="Qatar's Arabic LLM aligned with Islamic values and Arab culture",
304
+ organization="QCRI (Qatar)",
305
+ arabic_support="Native",
306
+ dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
307
+ special_features=["Islamic RAG", "Cultural alignment", "Gemma-based"]
308
+ ),
309
+ "stabilityai/ar-stablelm-2-chat": TokenizerInfo(
310
+ name="Arabic StableLM 2 Chat",
311
+ model_id="stabilityai/ar-stablelm-2-chat",
312
+ type=TokenizerType.ARABIC_LLM,
313
+ algorithm=TokenizerAlgorithm.BPE,
314
+ vocab_size=100289,
315
+ description="Stability AI's Arabic instruction-tuned 1.6B model",
316
+ organization="Stability AI",
317
+ arabic_support="Native",
318
+ dialect_support=["MSA"],
319
+ special_features=["Compact 1.6B", "Chat optimized", "Efficient"]
320
+ ),
321
+ "Navid-AI/Yehia-7B-preview": TokenizerInfo(
322
+ name="Yehia 7B Preview",
323
+ model_id="Navid-AI/Yehia-7B-preview",
324
+ type=TokenizerType.ARABIC_LLM,
325
+ algorithm=TokenizerAlgorithm.BPE,
326
+ vocab_size=128256,
327
+ description="Best Arabic model on AraGen-Leaderboard (0.5B-25B), GRPO trained",
328
+ organization="Navid AI",
329
+ arabic_support="Native",
330
+ dialect_support=["MSA", "Gulf", "Egyptian", "Levantine"],
331
+ special_features=["GRPO trained", "3C3H aligned", "SOTA AraGen"]
332
+ ),
333
+
334
+ # ========== DIALECT-SPECIFIC MODELS ==========
335
+ "MBZUAI-Paris/Atlas-Chat-9B": TokenizerInfo(
336
+ name="Atlas-Chat 9B (Darija)",
337
+ model_id="MBZUAI-Paris/Atlas-Chat-9B",
338
+ type=TokenizerType.ARABIC_LLM,
339
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
340
+ vocab_size=256000,
341
+ description="First LLM for Moroccan Arabic (Darija), Gemma-based",
342
+ organization="MBZUAI Paris",
343
+ arabic_support="Native",
344
+ dialect_support=["Darija", "MSA"],
345
+ special_features=["Moroccan dialect", "Transliteration", "Cultural"]
346
+ ),
347
+ "MBZUAI-Paris/Atlas-Chat-2B": TokenizerInfo(
348
+ name="Atlas-Chat 2B (Darija)",
349
+ model_id="MBZUAI-Paris/Atlas-Chat-2B",
350
+ type=TokenizerType.ARABIC_LLM,
351
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
352
+ vocab_size=256000,
353
+ description="Compact Moroccan Arabic model for edge deployment",
354
+ organization="MBZUAI Paris",
355
+ arabic_support="Native",
356
+ dialect_support=["Darija", "MSA"],
357
+ special_features=["Compact", "Moroccan dialect", "Edge-ready"]
358
+ ),
359
+
360
+ # ========== MULTILINGUAL LLMs ==========
361
+ "Qwen/Qwen2.5-7B": TokenizerInfo(
362
+ name="Qwen 2.5 7B",
363
+ model_id="Qwen/Qwen2.5-7B",
364
+ type=TokenizerType.MULTILINGUAL_LLM,
365
+ algorithm=TokenizerAlgorithm.BPE,
366
+ vocab_size=151936,
367
+ description="Alibaba's multilingual LLM with 30+ language support",
368
+ organization="Alibaba Qwen",
369
+ arabic_support="Supported",
370
+ dialect_support=["MSA"],
371
+ special_features=["152K vocab", "128K context", "30+ languages"]
372
+ ),
373
+ "google/gemma-2-9b": TokenizerInfo(
374
+ name="Gemma 2 9B",
375
+ model_id="google/gemma-2-9b",
376
+ type=TokenizerType.MULTILINGUAL_LLM,
377
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
378
+ vocab_size=256000,
379
+ description="Google's efficient multilingual model",
380
+ organization="Google",
381
+ arabic_support="Supported",
382
+ dialect_support=["MSA"],
383
+ special_features=["256K vocab", "Efficient architecture"]
384
+ ),
385
+ "mistralai/Mistral-7B-v0.3": TokenizerInfo(
386
+ name="Mistral 7B v0.3",
387
+ model_id="mistralai/Mistral-7B-v0.3",
388
+ type=TokenizerType.MULTILINGUAL_LLM,
389
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
390
+ vocab_size=32768,
391
+ description="Efficient multilingual model with sliding window attention",
392
+ organization="Mistral AI",
393
+ arabic_support="Limited",
394
+ dialect_support=["MSA"],
395
+ special_features=["Sliding window", "Efficient"]
396
+ ),
397
+ "mistralai/Mistral-Nemo-Base-2407": TokenizerInfo(
398
+ name="Mistral Nemo",
399
+ model_id="mistralai/Mistral-Nemo-Base-2407",
400
+ type=TokenizerType.MULTILINGUAL_LLM,
401
+ algorithm=TokenizerAlgorithm.TIKTOKEN,
402
+ vocab_size=131072,
403
+ description="Uses Tekken tokenizer, optimized for multilingual",
404
+ organization="Mistral AI + NVIDIA",
405
+ arabic_support="Supported",
406
+ dialect_support=["MSA"],
407
+ special_features=["Tekken tokenizer", "131K vocab", "Multilingual optimized"]
408
+ ),
409
+ "xlm-roberta-base": TokenizerInfo(
410
+ name="XLM-RoBERTa Base",
411
+ model_id="xlm-roberta-base",
412
+ type=TokenizerType.MULTILINGUAL_LLM,
413
+ algorithm=TokenizerAlgorithm.SENTENCEPIECE,
414
+ vocab_size=250002,
415
+ description="Cross-lingual model covering 100 languages",
416
+ organization="Facebook AI",
417
+ arabic_support="Supported",
418
+ dialect_support=["MSA"],
419
+ special_features=["250K vocab", "100 languages"]
420
+ ),
421
+ "bert-base-multilingual-cased": TokenizerInfo(
422
+ name="mBERT",
423
+ model_id="bert-base-multilingual-cased",
424
+ type=TokenizerType.MULTILINGUAL_LLM,
425
+ algorithm=TokenizerAlgorithm.WORDPIECE,
426
+ vocab_size=119547,
427
+ description="Original multilingual BERT, baseline for comparison",
428
+ organization="Google",
429
+ arabic_support="Limited",
430
+ dialect_support=["MSA"],
431
+ special_features=["Baseline model", "104 languages"]
432
+ ),
433
+ "tiiuae/falcon-7b": TokenizerInfo(
434
+ name="Falcon 7B",
435
+ model_id="tiiuae/falcon-7b",
436
+ type=TokenizerType.MULTILINGUAL_LLM,
437
+ algorithm=TokenizerAlgorithm.BPE,
438
+ vocab_size=65024,
439
+ description="TII's powerful open-source LLM",
440
+ organization="Technology Innovation Institute",
441
+ arabic_support="Limited",
442
+ dialect_support=["MSA"],
443
+ special_features=["65K vocab", "RefinedWeb trained"]
444
+ ),
445
+ }
446
+
447
+
448
+ # ============================================================================
449
+ # LEADERBOARD DATASETS - Real HuggingFace Datasets
450
+ # ============================================================================
451
+
452
+ LEADERBOARD_DATASETS = {
453
+ "arabic_mmlu": {
454
+ "hf_id": "MBZUAI/ArabicMMLU",
455
+ "name": "ArabicMMLU",
456
+ "category": "MSA Benchmark",
457
+ "text_column": "Question",
458
+ "split": "test",
459
+ "subset": None,
460
+ "samples": 500,
461
+ "description": "Multi-task benchmark from Arab school exams (14,575 MCQs)"
462
+ },
463
+ "arsentd_lev": {
464
+ "hf_id": "ramybaly/arsentd_lev",
465
+ "name": "ArSenTD-LEV",
466
+ "category": "Levantine Dialect",
467
+ "text_column": "Tweet",
468
+ "split": "train",
469
+ "subset": None,
470
+ "samples": 500,
471
+ "description": "Levantine Arabic tweets (Jordan, Lebanon, Syria, Palestine)"
472
+ },
473
+ "athar": {
474
+ "hf_id": "mohamed-khalil/ATHAR",
475
+ "name": "ATHAR Classical",
476
+ "category": "Classical Arabic",
477
+ "text_column": "arabic",
478
+ "split": "train",
479
+ "subset": None,
480
+ "samples": 500,
481
+ "description": "66K classical Arabic sentences with translations"
482
+ },
483
+ "arcd": {
484
+ "hf_id": "arcd",
485
+ "name": "ARCD",
486
+ "category": "QA Dataset",
487
+ "text_column": "context",
488
+ "split": "train",
489
+ "subset": None,
490
+ "samples": 300,
491
+ "description": "Arabic Reading Comprehension Dataset (1,395 questions)"
492
+ },
493
+ "ashaar": {
494
+ "hf_id": "arbml/Ashaar_dataset",
495
+ "name": "Ashaar Poetry",
496
+ "category": "Poetry",
497
+ "text_column": "poem_text",
498
+ "split": "train",
499
+ "subset": None,
500
+ "samples": 500,
501
+ "description": "2M+ Arabic poetry verses with meter and theme labels"
502
+ },
503
+ "hadith": {
504
+ "hf_id": "gurgutan/sunnah_ar_en_dataset",
505
+ "name": "Hadith Collection",
506
+ "category": "Religious",
507
+ "text_column": "hadith_text_ar",
508
+ "split": "train",
509
+ "subset": None,
510
+ "samples": 400,
511
+ "description": "50,762 hadiths from 14 authentic books"
512
+ },
513
+ "arabic_sentiment": {
514
+ "hf_id": "arbml/Arabic_Sentiment_Twitter_Corpus",
515
+ "name": "Arabic Sentiment",
516
+ "category": "Social Media",
517
+ "text_column": "text",
518
+ "split": "train",
519
+ "subset": None,
520
+ "samples": 500,
521
+ "description": "Arabic Twitter sentiment corpus"
522
+ },
523
+ "sanad": {
524
+ "hf_id": "arbml/SANAD",
525
+ "name": "SANAD News",
526
+ "category": "News",
527
+ "text_column": "text",
528
+ "split": "train",
529
+ "subset": "alarabiya",
530
+ "samples": 400,
531
+ "description": "Arabic news articles from Al Arabiya"
532
+ },
533
+ }
534
+
535
+
536
+ # ============================================================================
537
+ # SAMPLE TEXTS
538
+ # ============================================================================
539
+
540
+ SAMPLE_TEXTS = {
541
+ "MSA News": "أعلنت وزارة التربية والتعليم عن بدء العام الدراسي الجديد في الأول من سبتمبر، حيث ستعود المدارس لاستقبال الطلاب بعد العطلة الصيفية الطويلة.",
542
+ "MSA Formal": "إن تطوير تقنيات الذكاء الاصطناعي يمثل نقلة نوعية في مجال معالجة اللغات الطبيعية، وخاصة فيما يتعلق باللغة العربية ذات الخصائص المورفولوجية الغنية.",
543
+ "Egyptian Dialect": "ازيك يا صاحبي؟ إيه أخبارك؟ عامل إيه النهارده؟ قولي هنروح فين بكره؟",
544
+ "Gulf Dialect": "شلونك؟ شخبارك؟ وش تسوي الحين؟ ودك تروح وياي للسوق؟",
545
+ "Levantine Dialect": "كيفك؟ شو أخبارك؟ شو عم تعمل هلق؟ بدك تيجي معي على السوق؟",
546
+ "Classical Arabic (Quran)": "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ ۝ الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ",
547
+ "Poetry": "وما من كاتبٍ إلا سيفنى ويُبقي الدهرُ ما كتبت يداهُ",
548
+ "Technical": "يستخدم نموذج المحولات آلية الانتباه الذاتي لمعالجة تسلسلات النصوص بشكل متوازي.",
549
+ "Mixed Arabic-English": "The Arabic language العربية is a Semitic language with over 400 million speakers worldwide.",
550
+ "With Diacritics": "إِنَّ اللَّهَ وَمَلَائِكَتَهُ يُصَلُّونَ عَلَى النَّبِيِّ",
551
+ }
leaderboard.py ADDED
@@ -0,0 +1,449 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Leaderboard Module
3
+ ==================
4
+ Evaluate tokenizers on real HuggingFace Arabic datasets
5
+ """
6
+
7
+ import statistics
8
+ from typing import Dict, List, Tuple, Optional
9
+ from collections import defaultdict
10
+ import gradio as gr
11
+ from datasets import load_dataset
12
+ from transformers import AutoTokenizer
13
+
14
+ from config import LEADERBOARD_DATASETS
15
+ from tokenizer_manager import tokenizer_manager
16
+
17
+
18
+ class HFDatasetLoader:
19
+ """Load Arabic datasets from HuggingFace"""
20
+
21
+ def __init__(self):
22
+ self.cache = {}
23
+
24
+ def load_dataset_texts(self, dataset_key: str) -> Tuple[List[str], str]:
25
+ """Load texts from a HuggingFace dataset"""
26
+
27
+ if dataset_key in self.cache:
28
+ return self.cache[dataset_key], f"✅ Loaded {len(self.cache[dataset_key])} samples (cached)"
29
+
30
+ config = LEADERBOARD_DATASETS.get(dataset_key)
31
+ if not config:
32
+ return [], f"❌ Unknown dataset: {dataset_key}"
33
+
34
+ try:
35
+ # Load dataset from HuggingFace
36
+ if config.get("subset"):
37
+ ds = load_dataset(
38
+ config["hf_id"],
39
+ config["subset"],
40
+ split=config["split"],
41
+ trust_remote_code=True
42
+ )
43
+ else:
44
+ ds = load_dataset(
45
+ config["hf_id"],
46
+ split=config["split"],
47
+ trust_remote_code=True
48
+ )
49
+
50
+ texts = []
51
+ text_col = config["text_column"]
52
+
53
+ # Try to find text column
54
+ if text_col not in ds.column_names:
55
+ for col in ["text", "content", "sentence", "arabic", "context", "Tweet", "question", "poem_text", "hadith_text_ar"]:
56
+ if col in ds.column_names:
57
+ text_col = col
58
+ break
59
+
60
+ # Extract texts
61
+ max_samples = config.get("samples", 500)
62
+ for i, item in enumerate(ds):
63
+ if i >= max_samples:
64
+ break
65
+ text = item.get(text_col, "")
66
+ if text and isinstance(text, str) and len(text.strip()) > 10:
67
+ texts.append(text.strip())
68
+
69
+ self.cache[dataset_key] = texts
70
+ return texts, f"✅ Loaded {len(texts)} samples from HuggingFace"
71
+
72
+ except Exception as e:
73
+ return [], f"❌ Error loading {config['hf_id']}: {str(e)[:80]}"
74
+
75
+
76
+ def evaluate_tokenizer_on_texts(tokenizer, texts: List[str]) -> Optional[Dict]:
77
+ """Evaluate a tokenizer on a list of texts"""
78
+
79
+ fertilities = []
80
+ compressions = []
81
+ unk_counts = 0
82
+ total_tokens = 0
83
+
84
+ for text in texts:
85
+ try:
86
+ tokens = tokenizer.encode(text, add_special_tokens=False)
87
+ decoded = tokenizer.convert_ids_to_tokens(tokens)
88
+
89
+ num_tokens = len(tokens)
90
+ num_words = len(text.split()) or 1
91
+ num_bytes = len(text.encode('utf-8'))
92
+
93
+ fertility = num_tokens / num_words
94
+ compression = num_bytes / num_tokens if num_tokens > 0 else 0
95
+
96
+ # Count UNKs
97
+ unk_token = getattr(tokenizer, 'unk_token', '[UNK]')
98
+ unks = sum(1 for t in decoded if t and (t == unk_token or '<unk>' in str(t).lower() or '[unk]' in str(t).lower()))
99
+
100
+ fertilities.append(fertility)
101
+ compressions.append(compression)
102
+ unk_counts += unks
103
+ total_tokens += num_tokens
104
+
105
+ except Exception:
106
+ continue
107
+
108
+ if not fertilities:
109
+ return None
110
+
111
+ return {
112
+ "avg_fertility": statistics.mean(fertilities),
113
+ "std_fertility": statistics.stdev(fertilities) if len(fertilities) > 1 else 0,
114
+ "avg_compression": statistics.mean(compressions),
115
+ "unk_ratio": unk_counts / total_tokens if total_tokens > 0 else 0,
116
+ "samples": len(fertilities)
117
+ }
118
+
119
+
120
+ def calculate_leaderboard_score(fertility: float, compression: float, unk_ratio: float) -> float:
121
+ """Calculate overall score (0-100, higher is better)"""
122
+ # Lower fertility is better (ideal ~1.0 for Arabic)
123
+ fertility_score = max(0, min(1, 2.0 / fertility)) if fertility > 0 else 0
124
+ # Higher compression is better
125
+ compression_score = min(1, compression / 6)
126
+ # Lower UNK is better
127
+ unk_score = 1 - min(1, unk_ratio * 20)
128
+
129
+ # Weighted combination
130
+ score = (fertility_score * 0.45 + compression_score * 0.35 + unk_score * 0.20) * 100
131
+ return round(score, 1)
132
+
133
+
134
+ def run_leaderboard_evaluation(
135
+ selected_datasets: List[str],
136
+ selected_tokenizers: List[str],
137
+ progress=gr.Progress()
138
+ ) -> Tuple[str, str, str]:
139
+ """
140
+ Run the full leaderboard evaluation with real HF datasets
141
+ Returns: (leaderboard_html, per_dataset_html, status_message)
142
+ """
143
+
144
+ if not selected_datasets:
145
+ return "", "", "⚠️ Please select at least one dataset"
146
+
147
+ if not selected_tokenizers:
148
+ return "", "", "⚠️ Please select at least one tokenizer"
149
+
150
+ loader = HFDatasetLoader()
151
+ results = defaultdict(dict)
152
+
153
+ # Status tracking
154
+ status_lines = []
155
+
156
+ # Load datasets from HuggingFace
157
+ status_lines.append("📚 **Loading Datasets from HuggingFace:**\n")
158
+ loaded_datasets = {}
159
+
160
+ for i, ds_key in enumerate(selected_datasets):
161
+ progress((i + 1) / len(selected_datasets) * 0.3, f"Loading {ds_key}...")
162
+ texts, msg = loader.load_dataset_texts(ds_key)
163
+ ds_name = LEADERBOARD_DATASETS[ds_key]["name"]
164
+ status_lines.append(f" • {ds_name}: {msg}")
165
+ if texts:
166
+ loaded_datasets[ds_key] = texts
167
+
168
+ if not loaded_datasets:
169
+ return "", "", "\n".join(status_lines) + "\n\n❌ No datasets loaded successfully"
170
+
171
+ # Evaluate tokenizers
172
+ status_lines.append("\n🔄 **Evaluating Tokenizers:**\n")
173
+
174
+ tokenizer_cache = {}
175
+ total_steps = len(selected_tokenizers) * len(loaded_datasets)
176
+ current_step = 0
177
+
178
+ for tok_choice in selected_tokenizers:
179
+ # Get model ID from choice
180
+ tok_id = tokenizer_manager.get_model_id_from_choice(tok_choice)
181
+ tok_info = tokenizer_manager.get_available_tokenizers().get(tok_id)
182
+ tok_name = tok_info.name if tok_info else tok_choice
183
+
184
+ # Load tokenizer
185
+ try:
186
+ if tok_id not in tokenizer_cache:
187
+ tokenizer_cache[tok_id] = AutoTokenizer.from_pretrained(
188
+ tok_id, trust_remote_code=True
189
+ )
190
+ tokenizer = tokenizer_cache[tok_id]
191
+ status_lines.append(f" • {tok_name}: ✅ Loaded")
192
+ except Exception as e:
193
+ status_lines.append(f" • {tok_name}: ❌ Failed ({str(e)[:30]})")
194
+ continue
195
+
196
+ # Evaluate on each dataset
197
+ for ds_key, texts in loaded_datasets.items():
198
+ current_step += 1
199
+ progress(0.3 + (current_step / total_steps) * 0.6, f"Evaluating {tok_name} on {ds_key}...")
200
+
201
+ metrics = evaluate_tokenizer_on_texts(tokenizer, texts)
202
+ if metrics:
203
+ results[tok_choice][ds_key] = metrics
204
+
205
+ # Generate leaderboard
206
+ progress(0.95, "Generating leaderboard...")
207
+
208
+ leaderboard_data = []
209
+ per_dataset_data = []
210
+
211
+ for tok_choice, ds_results in results.items():
212
+ if not ds_results:
213
+ continue
214
+
215
+ tok_id = tokenizer_manager.get_model_id_from_choice(tok_choice)
216
+ tok_info = tokenizer_manager.get_available_tokenizers().get(tok_id)
217
+
218
+ # Aggregate across datasets
219
+ all_fertility = [m["avg_fertility"] for m in ds_results.values()]
220
+ all_compression = [m["avg_compression"] for m in ds_results.values()]
221
+ all_unk = [m["unk_ratio"] for m in ds_results.values()]
222
+
223
+ avg_fertility = statistics.mean(all_fertility)
224
+ avg_compression = statistics.mean(all_compression)
225
+ avg_unk = statistics.mean(all_unk)
226
+
227
+ score = calculate_leaderboard_score(avg_fertility, avg_compression, avg_unk)
228
+
229
+ leaderboard_data.append({
230
+ "name": tok_info.name if tok_info else tok_choice,
231
+ "type": tok_info.type.value if tok_info else "Unknown",
232
+ "org": tok_info.organization if tok_info else "Unknown",
233
+ "score": score,
234
+ "fertility": avg_fertility,
235
+ "compression": avg_compression,
236
+ "unk_ratio": avg_unk,
237
+ "num_datasets": len(ds_results)
238
+ })
239
+
240
+ # Per-dataset row
241
+ per_ds_row = {"Tokenizer": tok_info.name if tok_info else tok_choice}
242
+ for ds_key in selected_datasets:
243
+ ds_name = LEADERBOARD_DATASETS[ds_key]["name"]
244
+ if ds_key in ds_results:
245
+ per_ds_row[ds_name] = round(ds_results[ds_key]["avg_fertility"], 2)
246
+ else:
247
+ per_ds_row[ds_name] = "-"
248
+ per_dataset_data.append(per_ds_row)
249
+
250
+ # Sort by score
251
+ leaderboard_data.sort(key=lambda x: x["score"], reverse=True)
252
+
253
+ # Create HTML tables
254
+ leaderboard_html = generate_leaderboard_html(leaderboard_data)
255
+ per_dataset_html = generate_per_dataset_html(per_dataset_data, selected_datasets)
256
+
257
+ status_lines.append(f"\n✅ **Evaluation Complete!** Evaluated {len(results)} tokenizers on {len(loaded_datasets)} datasets.")
258
+
259
+ return leaderboard_html, per_dataset_html, "\n".join(status_lines)
260
+
261
+
262
+ def generate_leaderboard_html(data: List[Dict]) -> str:
263
+ """Generate HTML for main leaderboard"""
264
+
265
+ if not data:
266
+ return "<p>No results to display</p>"
267
+
268
+ html = """
269
+ <style>
270
+ .leaderboard-table {
271
+ width: 100%;
272
+ border-collapse: collapse;
273
+ font-family: system-ui, -apple-system, sans-serif;
274
+ margin: 20px 0;
275
+ }
276
+ .leaderboard-table th {
277
+ background: linear-gradient(135deg, #1a5f2a 0%, #2d8f4e 100%);
278
+ color: white;
279
+ padding: 12px 8px;
280
+ text-align: left;
281
+ font-weight: 600;
282
+ }
283
+ .leaderboard-table td {
284
+ padding: 10px 8px;
285
+ border-bottom: 1px solid #e0e0e0;
286
+ }
287
+ .leaderboard-table tr:nth-child(even) {
288
+ background-color: #f8f9fa;
289
+ }
290
+ .leaderboard-table tr:hover {
291
+ background-color: #e8f5e9;
292
+ }
293
+ .rank-1 { background: linear-gradient(90deg, #ffd700 0%, #fff8dc 100%) !important; }
294
+ .rank-2 { background: linear-gradient(90deg, #c0c0c0 0%, #f5f5f5 100%) !important; }
295
+ .rank-3 { background: linear-gradient(90deg, #cd7f32 0%, #ffe4c4 100%) !important; }
296
+ .score-badge {
297
+ background: #2d8f4e;
298
+ color: white;
299
+ padding: 4px 8px;
300
+ border-radius: 12px;
301
+ font-weight: bold;
302
+ }
303
+ .type-badge {
304
+ background: #e3f2fd;
305
+ color: #1565c0;
306
+ padding: 2px 6px;
307
+ border-radius: 4px;
308
+ font-size: 0.85em;
309
+ }
310
+ .metric-good { color: #2e7d32; font-weight: 600; }
311
+ .metric-bad { color: #c62828; }
312
+ </style>
313
+
314
+ <table class="leaderboard-table">
315
+ <thead>
316
+ <tr>
317
+ <th>Rank</th>
318
+ <th>Tokenizer</th>
319
+ <th>Type</th>
320
+ <th>Organization</th>
321
+ <th>Score ↑</th>
322
+ <th>Fertility ↓</th>
323
+ <th>Compression ↑</th>
324
+ <th>UNK Rate ↓</th>
325
+ <th>Datasets</th>
326
+ </tr>
327
+ </thead>
328
+ <tbody>
329
+ """
330
+
331
+ for i, entry in enumerate(data):
332
+ rank = i + 1
333
+ rank_class = f"rank-{rank}" if rank <= 3 else ""
334
+
335
+ fert_class = "metric-good" if entry["fertility"] < 2.0 else "metric-bad" if entry["fertility"] > 3.0 else ""
336
+ comp_class = "metric-good" if entry["compression"] > 3.5 else ""
337
+ unk_class = "metric-good" if entry["unk_ratio"] < 0.01 else "metric-bad" if entry["unk_ratio"] > 0.05 else ""
338
+
339
+ html += f"""
340
+ <tr class="{rank_class}">
341
+ <td><strong>#{rank}</strong></td>
342
+ <td><strong>{entry["name"]}</strong></td>
343
+ <td><span class="type-badge">{entry["type"]}</span></td>
344
+ <td>{entry["org"]}</td>
345
+ <td><span class="score-badge">{entry["score"]}</span></td>
346
+ <td class="{fert_class}">{entry["fertility"]:.3f}</td>
347
+ <td class="{comp_class}">{entry["compression"]:.2f}</td>
348
+ <td class="{unk_class}">{entry["unk_ratio"]:.2%}</td>
349
+ <td>{entry["num_datasets"]}</td>
350
+ </tr>
351
+ """
352
+
353
+ html += """
354
+ </tbody>
355
+ </table>
356
+
357
+ <div style="margin-top: 15px; padding: 10px; background: #f5f5f5; border-radius: 8px; font-size: 0.9em;">
358
+ <strong>📊 Metric Guide:</strong><br>
359
+ • <strong>Score:</strong> Overall ranking (0-100, higher = better)<br>
360
+ • <strong>Fertility:</strong> Tokens per word (lower = better, 1.0 ideal for Arabic)<br>
361
+ • <strong>Compression:</strong> Bytes per token (higher = more efficient)<br>
362
+ • <strong>UNK Rate:</strong> Unknown token percentage (lower = better)
363
+ </div>
364
+ """
365
+
366
+ return html
367
+
368
+
369
+ def generate_per_dataset_html(data: List[Dict], dataset_keys: List[str]) -> str:
370
+ """Generate HTML for per-dataset fertility table"""
371
+
372
+ if not data:
373
+ return "<p>No per-dataset results</p>"
374
+
375
+ ds_names = [LEADERBOARD_DATASETS[k]["name"] for k in dataset_keys]
376
+
377
+ html = """
378
+ <style>
379
+ .dataset-table {
380
+ width: 100%;
381
+ border-collapse: collapse;
382
+ font-family: system-ui, -apple-system, sans-serif;
383
+ margin: 20px 0;
384
+ font-size: 0.9em;
385
+ }
386
+ .dataset-table th {
387
+ background: #37474f;
388
+ color: white;
389
+ padding: 10px 6px;
390
+ text-align: center;
391
+ }
392
+ .dataset-table th:first-child {
393
+ text-align: left;
394
+ }
395
+ .dataset-table td {
396
+ padding: 8px 6px;
397
+ text-align: center;
398
+ border-bottom: 1px solid #e0e0e0;
399
+ }
400
+ .dataset-table td:first-child {
401
+ text-align: left;
402
+ font-weight: 500;
403
+ }
404
+ .dataset-table tr:nth-child(even) {
405
+ background-color: #fafafa;
406
+ }
407
+ .fert-excellent { background: #c8e6c9; color: #1b5e20; font-weight: 600; }
408
+ .fert-good { background: #fff9c4; color: #f57f17; }
409
+ .fert-poor { background: #ffcdd2; color: #b71c1c; }
410
+ </style>
411
+
412
+ <h4>📈 Fertility per Dataset (tokens/word - lower is better)</h4>
413
+ <table class="dataset-table">
414
+ <thead>
415
+ <tr>
416
+ <th>Tokenizer</th>
417
+ """
418
+
419
+ for ds_name in ds_names:
420
+ html += f"<th>{ds_name}</th>"
421
+
422
+ html += """
423
+ </tr>
424
+ </thead>
425
+ <tbody>
426
+ """
427
+
428
+ for row in data:
429
+ html += f"<tr><td>{row['Tokenizer']}</td>"
430
+ for ds_name in ds_names:
431
+ val = row.get(ds_name, "-")
432
+ if val != "-":
433
+ if val < 1.8:
434
+ cls = "fert-excellent"
435
+ elif val < 2.5:
436
+ cls = "fert-good"
437
+ else:
438
+ cls = "fert-poor"
439
+ html += f'<td class="{cls}">{val}</td>'
440
+ else:
441
+ html += '<td>-</td>'
442
+ html += "</tr>"
443
+
444
+ html += """
445
+ </tbody>
446
+ </table>
447
+ """
448
+
449
+ return html
requirements.txt CHANGED
@@ -1 +1,7 @@
1
- aranizer
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ transformers>=4.35.0
3
+ huggingface_hub>=0.19.0
4
+ datasets>=2.14.0
5
+ torch
6
+ sentencepiece
7
+ protobuf
styles.py ADDED
@@ -0,0 +1,526 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ CSS Styles
3
+ ==========
4
+ All custom CSS for the Arabic Tokenizer Arena
5
+ """
6
+
7
+ CUSTOM_CSS = """
8
+ /* ===== ROOT VARIABLES ===== */
9
+ :root {
10
+ --primary: #1a5f2a;
11
+ --primary-light: #2d8f4e;
12
+ --secondary: #4a90d9;
13
+ --accent: #f59e0b;
14
+ --success: #10b981;
15
+ --warning: #f57c00;
16
+ --error: #c62828;
17
+ --bg-primary: #0f1419;
18
+ --bg-secondary: #1c2128;
19
+ --bg-card: #22272e;
20
+ --text-primary: #e6edf3;
21
+ --text-secondary: #8b949e;
22
+ --border: #30363d;
23
+ }
24
+
25
+ /* ===== HEADER ===== */
26
+ .header-section {
27
+ text-align: center;
28
+ padding: 2rem 1rem;
29
+ background: linear-gradient(135deg, var(--primary) 0%, var(--primary-light) 100%);
30
+ border-radius: 16px;
31
+ margin-bottom: 1.5rem;
32
+ }
33
+
34
+ .header-section h1 {
35
+ font-size: 2.5rem;
36
+ color: white;
37
+ margin-bottom: 0.5rem;
38
+ }
39
+
40
+ .header-section p {
41
+ color: rgba(255,255,255,0.9);
42
+ font-size: 1.1rem;
43
+ }
44
+
45
+ /* ===== INFO CARD ===== */
46
+ .info-card {
47
+ background: var(--bg-card);
48
+ border-radius: 12px;
49
+ padding: 1.5rem;
50
+ border: 1px solid var(--border);
51
+ }
52
+
53
+ .info-header {
54
+ display: flex;
55
+ justify-content: space-between;
56
+ align-items: center;
57
+ margin-bottom: 1rem;
58
+ flex-wrap: wrap;
59
+ gap: 0.5rem;
60
+ }
61
+
62
+ .info-header h3 {
63
+ color: var(--text-primary);
64
+ margin: 0;
65
+ }
66
+
67
+ .org-badge {
68
+ background: var(--primary);
69
+ color: white;
70
+ padding: 0.25rem 0.75rem;
71
+ border-radius: 20px;
72
+ font-size: 0.85rem;
73
+ }
74
+
75
+ .description {
76
+ color: var(--text-secondary);
77
+ line-height: 1.6;
78
+ }
79
+
80
+ .info-grid {
81
+ display: grid;
82
+ grid-template-columns: repeat(2, 1fr);
83
+ gap: 1rem;
84
+ margin: 1rem 0;
85
+ }
86
+
87
+ .info-item {
88
+ display: flex;
89
+ flex-direction: column;
90
+ }
91
+
92
+ .info-label {
93
+ color: var(--text-secondary);
94
+ font-size: 0.85rem;
95
+ }
96
+
97
+ .info-value {
98
+ color: var(--text-primary);
99
+ font-weight: 600;
100
+ }
101
+
102
+ .support-native { color: var(--success); }
103
+ .support-supported { color: var(--secondary); }
104
+ .support-limited { color: var(--warning); }
105
+
106
+ /* ===== BADGES ===== */
107
+ .badge-container {
108
+ margin-top: 1rem;
109
+ }
110
+
111
+ .badge-group {
112
+ margin-bottom: 0.5rem;
113
+ }
114
+
115
+ .badge-label {
116
+ color: var(--text-secondary);
117
+ font-size: 0.85rem;
118
+ margin-right: 0.5rem;
119
+ }
120
+
121
+ .badge {
122
+ display: inline-block;
123
+ padding: 0.2rem 0.5rem;
124
+ border-radius: 4px;
125
+ font-size: 0.75rem;
126
+ margin-right: 0.25rem;
127
+ margin-bottom: 0.25rem;
128
+ }
129
+
130
+ .badge.dialect {
131
+ background: rgba(74, 144, 217, 0.2);
132
+ color: var(--secondary);
133
+ }
134
+
135
+ .badge.feature {
136
+ background: rgba(245, 158, 11, 0.2);
137
+ color: var(--accent);
138
+ }
139
+
140
+ /* ===== METRICS GRID ===== */
141
+ .metrics-grid {
142
+ display: grid;
143
+ grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
144
+ gap: 1rem;
145
+ margin: 1rem 0;
146
+ }
147
+
148
+ .metric-card {
149
+ background: var(--bg-card);
150
+ border-radius: 12px;
151
+ padding: 1rem;
152
+ text-align: center;
153
+ border: 1px solid var(--border);
154
+ transition: transform 0.2s;
155
+ }
156
+
157
+ .metric-card:hover {
158
+ transform: translateY(-2px);
159
+ }
160
+
161
+ .metric-card.excellent {
162
+ border-color: var(--success);
163
+ background: linear-gradient(to bottom, rgba(16, 185, 129, 0.1), transparent);
164
+ }
165
+
166
+ .metric-card.good {
167
+ border-color: var(--secondary);
168
+ background: linear-gradient(to bottom, rgba(74, 144, 217, 0.1), transparent);
169
+ }
170
+
171
+ .metric-card.poor {
172
+ border-color: var(--error);
173
+ background: linear-gradient(to bottom, rgba(198, 40, 40, 0.1), transparent);
174
+ }
175
+
176
+ .metric-card.primary {
177
+ border-color: var(--primary);
178
+ background: linear-gradient(to bottom, rgba(26, 95, 42, 0.1), transparent);
179
+ }
180
+
181
+ .metric-icon {
182
+ font-size: 1.5rem;
183
+ margin-bottom: 0.5rem;
184
+ }
185
+
186
+ .metric-value {
187
+ font-size: 1.5rem;
188
+ font-weight: 700;
189
+ color: var(--text-primary);
190
+ }
191
+
192
+ .metric-label {
193
+ font-size: 0.8rem;
194
+ color: var(--text-secondary);
195
+ margin-top: 0.25rem;
196
+ }
197
+
198
+ .metric-hint {
199
+ font-size: 0.7rem;
200
+ color: var(--text-secondary);
201
+ opacity: 0.7;
202
+ }
203
+
204
+ /* ===== TOKEN VISUALIZATION ===== */
205
+ .token-container {
206
+ display: flex;
207
+ flex-wrap: wrap;
208
+ gap: 0.5rem;
209
+ padding: 1rem;
210
+ background: var(--bg-secondary);
211
+ border-radius: 12px;
212
+ direction: rtl;
213
+ }
214
+
215
+ .token {
216
+ display: inline-flex;
217
+ flex-direction: column;
218
+ align-items: center;
219
+ padding: 0.5rem 0.75rem;
220
+ border-radius: 8px;
221
+ font-family: 'IBM Plex Sans Arabic', monospace;
222
+ font-size: 1rem;
223
+ transition: transform 0.2s;
224
+ cursor: default;
225
+ }
226
+
227
+ .token:hover {
228
+ transform: scale(1.05);
229
+ }
230
+
231
+ .token-id {
232
+ font-size: 0.65rem;
233
+ opacity: 0.7;
234
+ margin-top: 0.25rem;
235
+ }
236
+
237
+ /* ===== DECODED SECTION ===== */
238
+ .decoded-section {
239
+ background: var(--bg-card);
240
+ border-radius: 12px;
241
+ padding: 1.5rem;
242
+ border: 1px solid var(--border);
243
+ }
244
+
245
+ .decoded-section h4 {
246
+ color: var(--text-primary);
247
+ margin-bottom: 1rem;
248
+ }
249
+
250
+ .decoded-text {
251
+ font-family: 'IBM Plex Sans Arabic', serif;
252
+ font-size: 1.1rem;
253
+ line-height: 1.8;
254
+ color: var(--text-primary);
255
+ }
256
+
257
+ .decoded-meta {
258
+ margin-top: 1rem;
259
+ font-size: 0.85rem;
260
+ color: var(--text-secondary);
261
+ }
262
+
263
+ /* ===== COMPARISON TABLE ===== */
264
+ .comparison-container {
265
+ overflow-x: auto;
266
+ }
267
+
268
+ .comparison-table {
269
+ width: 100%;
270
+ border-collapse: collapse;
271
+ margin: 1rem 0;
272
+ }
273
+
274
+ .comparison-table th {
275
+ background: var(--primary);
276
+ color: white;
277
+ padding: 0.75rem;
278
+ text-align: left;
279
+ font-weight: 600;
280
+ }
281
+
282
+ .comparison-table td {
283
+ padding: 0.75rem;
284
+ border-bottom: 1px solid var(--border);
285
+ color: var(--text-primary);
286
+ }
287
+
288
+ .comparison-table tr:hover {
289
+ background: rgba(74, 144, 217, 0.1);
290
+ }
291
+
292
+ .comparison-table .rank-1 {
293
+ background: linear-gradient(90deg, rgba(255, 215, 0, 0.2), transparent);
294
+ }
295
+
296
+ .comparison-table .rank-2 {
297
+ background: linear-gradient(90deg, rgba(192, 192, 192, 0.2), transparent);
298
+ }
299
+
300
+ .comparison-table .rank-3 {
301
+ background: linear-gradient(90deg, rgba(205, 127, 50, 0.2), transparent);
302
+ }
303
+
304
+ .comparison-table .excellent {
305
+ color: var(--success);
306
+ font-weight: 600;
307
+ }
308
+
309
+ .comparison-table .good {
310
+ color: var(--secondary);
311
+ }
312
+
313
+ .comparison-table .poor {
314
+ color: var(--error);
315
+ }
316
+
317
+ /* ===== ABOUT PAGE ===== */
318
+ .about-container {
319
+ padding: 1rem;
320
+ }
321
+
322
+ .about-header {
323
+ text-align: center;
324
+ margin-bottom: 2rem;
325
+ }
326
+
327
+ .about-header h2 {
328
+ color: var(--text-primary);
329
+ font-size: 2rem;
330
+ margin-bottom: 0.5rem;
331
+ }
332
+
333
+ .about-subtitle {
334
+ color: var(--text-secondary);
335
+ font-size: 1.1rem;
336
+ }
337
+
338
+ .about-stats {
339
+ display: flex;
340
+ justify-content: center;
341
+ gap: 2rem;
342
+ margin: 2rem 0;
343
+ flex-wrap: wrap;
344
+ }
345
+
346
+ .stat-card {
347
+ background: var(--bg-card);
348
+ border: 1px solid var(--border);
349
+ border-radius: 12px;
350
+ padding: 1.5rem 2rem;
351
+ text-align: center;
352
+ }
353
+
354
+ .stat-value {
355
+ font-size: 2.5rem;
356
+ font-weight: 700;
357
+ color: var(--primary-light);
358
+ }
359
+
360
+ .stat-label {
361
+ color: var(--text-secondary);
362
+ font-size: 0.9rem;
363
+ margin-top: 0.25rem;
364
+ }
365
+
366
+ .about-tokenizers {
367
+ margin: 2rem 0;
368
+ }
369
+
370
+ .about-tokenizers h3 {
371
+ color: var(--text-primary);
372
+ margin-bottom: 1rem;
373
+ }
374
+
375
+ .tokenizer-grid {
376
+ display: grid;
377
+ grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
378
+ gap: 1.5rem;
379
+ }
380
+
381
+ .about-category {
382
+ background: var(--bg-card);
383
+ border: 1px solid var(--border);
384
+ border-radius: 12px;
385
+ padding: 1rem 1.5rem;
386
+ }
387
+
388
+ .about-category h4 {
389
+ color: var(--primary-light);
390
+ margin-bottom: 0.75rem;
391
+ font-size: 1rem;
392
+ }
393
+
394
+ .about-category ul {
395
+ list-style: none;
396
+ padding: 0;
397
+ margin: 0;
398
+ }
399
+
400
+ .about-category li {
401
+ color: var(--text-secondary);
402
+ font-size: 0.9rem;
403
+ padding: 0.25rem 0;
404
+ border-bottom: 1px solid var(--border);
405
+ }
406
+
407
+ .about-category li:last-child {
408
+ border-bottom: none;
409
+ }
410
+
411
+ .about-features {
412
+ margin: 2rem 0;
413
+ }
414
+
415
+ .about-features h3 {
416
+ color: var(--text-primary);
417
+ margin-bottom: 1rem;
418
+ }
419
+
420
+ .feature-grid {
421
+ display: grid;
422
+ grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
423
+ gap: 1rem;
424
+ }
425
+
426
+ .feature-item {
427
+ display: flex;
428
+ align-items: center;
429
+ gap: 0.75rem;
430
+ padding: 0.75rem 1rem;
431
+ background: var(--bg-card);
432
+ border: 1px solid var(--border);
433
+ border-radius: 8px;
434
+ color: var(--text-secondary);
435
+ }
436
+
437
+ .feature-icon {
438
+ font-size: 1.25rem;
439
+ }
440
+
441
+ .about-usecases {
442
+ margin: 2rem 0;
443
+ }
444
+
445
+ .about-usecases h3 {
446
+ color: var(--text-primary);
447
+ margin-bottom: 1rem;
448
+ }
449
+
450
+ .usecase-grid {
451
+ display: grid;
452
+ grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
453
+ gap: 1rem;
454
+ }
455
+
456
+ .usecase-card {
457
+ background: var(--bg-card);
458
+ border: 1px solid var(--border);
459
+ border-radius: 12px;
460
+ padding: 1.25rem;
461
+ }
462
+
463
+ .usecase-card h4 {
464
+ color: var(--primary-light);
465
+ margin-bottom: 0.5rem;
466
+ }
467
+
468
+ .usecase-card p {
469
+ color: var(--text-secondary);
470
+ font-size: 0.9rem;
471
+ margin: 0;
472
+ }
473
+
474
+ .about-footer {
475
+ text-align: center;
476
+ margin-top: 2rem;
477
+ padding-top: 1.5rem;
478
+ border-top: 1px solid var(--border);
479
+ color: var(--text-secondary);
480
+ }
481
+
482
+ /* ===== UTILITY CLASSES ===== */
483
+ .warning {
484
+ background: linear-gradient(to right, rgba(245, 124, 0, 0.1), transparent);
485
+ border-left: 4px solid var(--warning);
486
+ padding: 1rem;
487
+ border-radius: 0 8px 8px 0;
488
+ color: var(--text-primary);
489
+ }
490
+
491
+ .error-card {
492
+ background: linear-gradient(to right, rgba(198, 40, 40, 0.1), transparent);
493
+ border-left: 4px solid var(--error);
494
+ padding: 1rem;
495
+ border-radius: 0 8px 8px 0;
496
+ }
497
+
498
+ .error-card h4 {
499
+ color: var(--error);
500
+ margin-bottom: 0.5rem;
501
+ }
502
+
503
+ .error-card p {
504
+ color: var(--text-secondary);
505
+ }
506
+
507
+ /* ===== RESPONSIVE ===== */
508
+ @media (max-width: 768px) {
509
+ .header-section h1 {
510
+ font-size: 1.75rem;
511
+ }
512
+
513
+ .info-grid {
514
+ grid-template-columns: 1fr;
515
+ }
516
+
517
+ .metrics-grid {
518
+ grid-template-columns: repeat(2, 1fr);
519
+ }
520
+
521
+ .about-stats {
522
+ flex-direction: column;
523
+ align-items: center;
524
+ }
525
+ }
526
+ """
tokenizer_manager.py ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tokenizer Manager
3
+ =================
4
+ Handles tokenizer loading, caching, and availability checking
5
+ """
6
+
7
+ import os
8
+ from typing import Dict, List, Any
9
+ from transformers import AutoTokenizer, logging
10
+ from config import TOKENIZER_REGISTRY, TokenizerInfo
11
+
12
+ logging.set_verbosity_error()
13
+
14
+ # HuggingFace authentication
15
+ HF_TOKEN = os.getenv('HF_TOKEN')
16
+ if HF_TOKEN:
17
+ HF_TOKEN = HF_TOKEN.strip()
18
+ from huggingface_hub import login
19
+ login(token=HF_TOKEN)
20
+
21
+
22
+ class TokenizerManager:
23
+ """Manages tokenizer loading and caching"""
24
+
25
+ def __init__(self):
26
+ self._cache: Dict[str, Any] = {}
27
+ self._available: Dict[str, TokenizerInfo] = {}
28
+ self._initialize_available_tokenizers()
29
+
30
+ def _initialize_available_tokenizers(self):
31
+ """Check which tokenizers are available and can be loaded"""
32
+ print("🔄 Initializing tokenizer registry...")
33
+
34
+ for model_id, info in TOKENIZER_REGISTRY.items():
35
+ try:
36
+ _ = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
37
+ self._available[model_id] = info
38
+ print(f" ✓ {info.name}")
39
+ except Exception as e:
40
+ print(f" ✗ {info.name}: {str(e)[:50]}")
41
+
42
+ print(f"\n✅ Total available tokenizers: {len(self._available)}")
43
+
44
+ def get_tokenizer(self, model_id: str):
45
+ """Get tokenizer from cache or load it"""
46
+ if model_id not in self._cache:
47
+ self._cache[model_id] = AutoTokenizer.from_pretrained(
48
+ model_id,
49
+ trust_remote_code=True
50
+ )
51
+ return self._cache[model_id]
52
+
53
+ def get_available_tokenizers(self) -> Dict[str, TokenizerInfo]:
54
+ """Get all available tokenizers"""
55
+ return self._available
56
+
57
+ def get_tokenizer_choices(self) -> List[str]:
58
+ """Get list of tokenizer display names for dropdown"""
59
+ return [f"{info.name} ({info.organization})" for info in self._available.values()]
60
+
61
+ def get_model_id_from_choice(self, choice: str) -> str:
62
+ """Convert display choice back to model ID"""
63
+ for model_id, info in self._available.items():
64
+ if f"{info.name} ({info.organization})" == choice:
65
+ return model_id
66
+ return list(self._available.keys())[0] if self._available else ""
67
+
68
+ def get_tokenizers_by_type(self) -> Dict[str, List[str]]:
69
+ """Group available tokenizers by type"""
70
+ choices = self.get_tokenizer_choices()
71
+
72
+ arabic_bert = [t for t in choices if any(x in t for x in ['AraBERT', 'CAMeL', 'MARBERT', 'ARBERT', 'Safaya'])]
73
+ arabic_specific = [t for t in choices if any(x in t for x in ['Aranizer'])]
74
+ arabic_llms = [t for t in choices if any(x in t for x in ['Jais', 'AceGPT', 'SILMA', 'Fanar', 'StableLM', 'Yehia', 'Atlas'])]
75
+ multilingual = [t for t in choices if t not in arabic_bert and t not in arabic_specific and t not in arabic_llms]
76
+
77
+ return {
78
+ "Arabic BERT Models": arabic_bert,
79
+ "Arabic Tokenizers": arabic_specific,
80
+ "Arabic LLMs": arabic_llms,
81
+ "Multilingual Models": multilingual
82
+ }
83
+
84
+
85
+ # Global tokenizer manager instance
86
+ tokenizer_manager = TokenizerManager()
ui_components.py ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ UI Components
3
+ =============
4
+ HTML generation functions for the Gradio interface
5
+ """
6
+
7
+ from typing import List
8
+ from config import TokenizerInfo, TokenizationMetrics
9
+ from utils import is_arabic_char
10
+
11
+
12
+ def generate_token_visualization(tokens: List[str], token_ids: List[int]) -> str:
13
+ """Generate beautiful HTML visualization of tokens"""
14
+
15
+ colors = [
16
+ ('#1a1a2e', '#eaeaea'),
17
+ ('#16213e', '#f0f0f0'),
18
+ ('#0f3460', '#ffffff'),
19
+ ('#533483', '#f5f5f5'),
20
+ ('#e94560', '#ffffff'),
21
+ ('#0f4c75', '#f0f0f0'),
22
+ ('#3282b8', '#ffffff'),
23
+ ('#bbe1fa', '#1a1a2e'),
24
+ ]
25
+
26
+ html_parts = []
27
+ for i, (token, tid) in enumerate(zip(tokens, token_ids)):
28
+ bg, fg = colors[i % len(colors)]
29
+ display_token = token.replace('<', '&lt;').replace('>', '&gt;')
30
+ is_arabic = any(is_arabic_char(c) for c in token)
31
+ direction = 'rtl' if is_arabic else 'ltr'
32
+
33
+ html_parts.append(f'''
34
+ <span class="token" style="
35
+ background: {bg};
36
+ color: {fg};
37
+ direction: {direction};
38
+ " title="ID: {tid}">
39
+ {display_token}
40
+ <span class="token-id">{tid}</span>
41
+ </span>
42
+ ''')
43
+
44
+ return f'''
45
+ <div class="token-container">
46
+ {''.join(html_parts)}
47
+ </div>
48
+ '''
49
+
50
+
51
+ def generate_metrics_card(metrics: TokenizationMetrics, info: TokenizerInfo) -> str:
52
+ """Generate metrics visualization card"""
53
+
54
+ fertility_quality = "excellent" if metrics.fertility < 1.5 else "good" if metrics.fertility < 2.5 else "poor"
55
+ strr_quality = "excellent" if metrics.single_token_retention_rate > 0.5 else "good" if metrics.single_token_retention_rate > 0.3 else "poor"
56
+ compression_quality = "excellent" if metrics.compression_ratio > 4 else "good" if metrics.compression_ratio > 2.5 else "poor"
57
+
58
+ return f'''
59
+ <div class="metrics-grid">
60
+ <div class="metric-card primary">
61
+ <div class="metric-icon">📊</div>
62
+ <div class="metric-value">{metrics.total_tokens}</div>
63
+ <div class="metric-label">Total Tokens</div>
64
+ </div>
65
+
66
+ <div class="metric-card {fertility_quality}">
67
+ <div class="metric-icon">🎯</div>
68
+ <div class="metric-value">{metrics.fertility:.3f}</div>
69
+ <div class="metric-label">Fertility (tokens/word)</div>
70
+ <div class="metric-hint">Lower is better (1.0 ideal)</div>
71
+ </div>
72
+
73
+ <div class="metric-card {compression_quality}">
74
+ <div class="metric-icon">📦</div>
75
+ <div class="metric-value">{metrics.compression_ratio:.2f}</div>
76
+ <div class="metric-label">Compression (bytes/token)</div>
77
+ <div class="metric-hint">Higher is better</div>
78
+ </div>
79
+
80
+ <div class="metric-card {strr_quality}">
81
+ <div class="metric-icon">✨</div>
82
+ <div class="metric-value">{metrics.single_token_retention_rate:.1%}</div>
83
+ <div class="metric-label">STRR (Single Token Retention)</div>
84
+ <div class="metric-hint">Higher is better</div>
85
+ </div>
86
+
87
+ <div class="metric-card">
88
+ <div class="metric-icon">🔤</div>
89
+ <div class="metric-value">{metrics.char_per_token:.2f}</div>
90
+ <div class="metric-label">Characters/Token</div>
91
+ </div>
92
+
93
+ <div class="metric-card {'excellent' if metrics.oov_percentage == 0 else 'poor' if metrics.oov_percentage > 5 else 'good'}">
94
+ <div class="metric-icon">❓</div>
95
+ <div class="metric-value">{metrics.oov_percentage:.1f}%</div>
96
+ <div class="metric-label">OOV Rate</div>
97
+ <div class="metric-hint">Lower is better (0% ideal)</div>
98
+ </div>
99
+
100
+ <div class="metric-card">
101
+ <div class="metric-icon">🌍</div>
102
+ <div class="metric-value">{metrics.arabic_fertility:.3f}</div>
103
+ <div class="metric-label">Arabic Fertility</div>
104
+ </div>
105
+
106
+ <div class="metric-card">
107
+ <div class="metric-icon">⚡</div>
108
+ <div class="metric-value">{metrics.tokenization_time_ms:.2f}ms</div>
109
+ <div class="metric-label">Processing Time</div>
110
+ </div>
111
+ </div>
112
+ '''
113
+
114
+
115
+ def generate_tokenizer_info_card(info: TokenizerInfo) -> str:
116
+ """Generate tokenizer information card"""
117
+
118
+ dialect_badges = ''.join([f'<span class="badge dialect">{d}</span>' for d in info.dialect_support])
119
+ feature_badges = ''.join([f'<span class="badge feature">{f}</span>' for f in info.special_features])
120
+
121
+ support_class = "native" if info.arabic_support == "Native" else "supported" if info.arabic_support == "Supported" else "limited"
122
+
123
+ return f'''
124
+ <div class="info-card">
125
+ <div class="info-header">
126
+ <h3>{info.name}</h3>
127
+ <span class="org-badge">{info.organization}</span>
128
+ </div>
129
+
130
+ <p class="description">{info.description}</p>
131
+
132
+ <div class="info-grid">
133
+ <div class="info-item">
134
+ <span class="info-label">Type:</span>
135
+ <span class="info-value">{info.type.value}</span>
136
+ </div>
137
+ <div class="info-item">
138
+ <span class="info-label">Algorithm:</span>
139
+ <span class="info-value">{info.algorithm.value}</span>
140
+ </div>
141
+ <div class="info-item">
142
+ <span class="info-label">Vocab Size:</span>
143
+ <span class="info-value">{info.vocab_size:,}</span>
144
+ </div>
145
+ <div class="info-item">
146
+ <span class="info-label">Arabic Support:</span>
147
+ <span class="info-value support-{support_class}">{info.arabic_support}</span>
148
+ </div>
149
+ </div>
150
+
151
+ <div class="badge-container">
152
+ <div class="badge-group">
153
+ <span class="badge-label">Dialects:</span>
154
+ {dialect_badges}
155
+ </div>
156
+ <div class="badge-group">
157
+ <span class="badge-label">Features:</span>
158
+ {feature_badges}
159
+ </div>
160
+ </div>
161
+ </div>
162
+ '''
163
+
164
+
165
+ def generate_decoded_section(metrics: TokenizationMetrics) -> str:
166
+ """Generate decoded output section"""
167
+ return f'''
168
+ <div class="decoded-section">
169
+ <h4>Decoded Output</h4>
170
+ <div class="decoded-text" dir="auto">{metrics.decoded_text}</div>
171
+ <div class="decoded-meta">
172
+ Diacritics preserved: {'✅ Yes' if metrics.diacritic_preservation else '❌ No'}
173
+ </div>
174
+ </div>
175
+ '''
176
+
177
+
178
+ def generate_about_html(tokenizers_by_type: dict, total_count: int) -> str:
179
+ """Generate About page HTML"""
180
+
181
+ # Build tokenizer lists
182
+ sections = []
183
+ for category, tokenizers in tokenizers_by_type.items():
184
+ if tokenizers:
185
+ items = ''.join([f'<li>{t}</li>' for t in tokenizers[:12]])
186
+ if len(tokenizers) > 12:
187
+ items += f'<li><em>...and {len(tokenizers) - 12} more</em></li>'
188
+ sections.append(f'''
189
+ <div class="about-category">
190
+ <h4>{category}</h4>
191
+ <ul>{items}</ul>
192
+ </div>
193
+ ''')
194
+
195
+ return f'''
196
+ <div class="about-container">
197
+ <div class="about-header">
198
+ <h2>🏟️ Arabic Tokenizer Arena Pro</h2>
199
+ <p class="about-subtitle">A comprehensive platform for evaluating Arabic tokenizers across multiple dimensions</p>
200
+ </div>
201
+
202
+ <div class="about-stats">
203
+ <div class="stat-card">
204
+ <div class="stat-value">{total_count}</div>
205
+ <div class="stat-label">Available Tokenizers</div>
206
+ </div>
207
+ <div class="stat-card">
208
+ <div class="stat-value">8</div>
209
+ <div class="stat-label">Evaluation Datasets</div>
210
+ </div>
211
+ <div class="stat-card">
212
+ <div class="stat-value">8+</div>
213
+ <div class="stat-label">Metrics</div>
214
+ </div>
215
+ </div>
216
+
217
+ <div class="about-tokenizers">
218
+ <h3>📚 Available Tokenizers</h3>
219
+ <div class="tokenizer-grid">
220
+ {''.join(sections)}
221
+ </div>
222
+ </div>
223
+
224
+ <div class="about-features">
225
+ <h3>✨ Features</h3>
226
+ <div class="feature-grid">
227
+ <div class="feature-item">
228
+ <span class="feature-icon">📊</span>
229
+ <span>Comprehensive efficiency metrics (fertility, compression, STRR)</span>
230
+ </div>
231
+ <div class="feature-item">
232
+ <span class="feature-icon">🌍</span>
233
+ <span>Arabic-specific analysis (dialect support, diacritic preservation)</span>
234
+ </div>
235
+ <div class="feature-item">
236
+ <span class="feature-icon">⚖️</span>
237
+ <span>Side-by-side tokenizer comparison</span>
238
+ </div>
239
+ <div class="feature-item">
240
+ <span class="feature-icon">🎨</span>
241
+ <span>Beautiful token visualization</span>
242
+ </div>
243
+ <div class="feature-item">
244
+ <span class="feature-icon">🏆</span>
245
+ <span>Leaderboard with real HuggingFace datasets</span>
246
+ </div>
247
+ <div class="feature-item">
248
+ <span class="feature-icon">📖</span>
249
+ <span>Support for MSA, dialectal, and Classical Arabic</span>
250
+ </div>
251
+ </div>
252
+ </div>
253
+
254
+ <div class="about-usecases">
255
+ <h3>🎯 Use Cases</h3>
256
+ <div class="usecase-grid">
257
+ <div class="usecase-card">
258
+ <h4>🔬 Research</h4>
259
+ <p>Compare tokenizers for Arabic NLP experiments</p>
260
+ </div>
261
+ <div class="usecase-card">
262
+ <h4>🚀 Production</h4>
263
+ <p>Select optimal tokenizer for deployment</p>
264
+ </div>
265
+ <div class="usecase-card">
266
+ <h4>📚 Education</h4>
267
+ <p>Understand how different algorithms handle Arabic</p>
268
+ </div>
269
+ <div class="usecase-card">
270
+ <h4>💰 Optimization</h4>
271
+ <p>Identify cost-efficient tokenizers for API usage</p>
272
+ </div>
273
+ </div>
274
+ </div>
275
+
276
+ <div class="about-footer">
277
+ <p>Built with ❤️ for the Arabic NLP community</p>
278
+ </div>
279
+ </div>
280
+ '''
utils.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Arabic Text Utilities
3
+ =====================
4
+ Helper functions for Arabic text analysis
5
+ """
6
+
7
+ import re
8
+ from typing import List
9
+
10
+
11
+ def is_arabic_char(char: str) -> bool:
12
+ """Check if character is Arabic"""
13
+ if len(char) != 1:
14
+ return False
15
+ code = ord(char)
16
+ return (
17
+ (0x0600 <= code <= 0x06FF) or # Arabic
18
+ (0x0750 <= code <= 0x077F) or # Arabic Supplement
19
+ (0x08A0 <= code <= 0x08FF) or # Arabic Extended-A
20
+ (0xFB50 <= code <= 0xFDFF) or # Arabic Presentation Forms-A
21
+ (0xFE70 <= code <= 0xFEFF) # Arabic Presentation Forms-B
22
+ )
23
+
24
+
25
+ def count_arabic_chars(text: str) -> int:
26
+ """Count Arabic characters in text"""
27
+ return sum(1 for c in text if is_arabic_char(c))
28
+
29
+
30
+ def has_diacritics(text: str) -> bool:
31
+ """Check if text contains Arabic diacritics (tashkeel)"""
32
+ diacritics = set('ًٌٍَُِّْٰ')
33
+ return any(c in diacritics for c in text)
34
+
35
+
36
+ def normalize_arabic(text: str) -> str:
37
+ """Basic Arabic normalization"""
38
+ # Normalize alef variants
39
+ text = re.sub('[إأآا]', 'ا', text)
40
+ # Normalize yeh
41
+ text = re.sub('ى', 'ي', text)
42
+ # Normalize teh marbuta
43
+ text = re.sub('ة', 'ه', text)
44
+ return text
45
+
46
+
47
+ def get_arabic_words(text: str) -> List[str]:
48
+ """Extract Arabic words from text"""
49
+ words = text.split()
50
+ return [w for w in words if any(is_arabic_char(c) for c in w)]
51
+
52
+
53
+ def remove_diacritics(text: str) -> str:
54
+ """Remove Arabic diacritics from text"""
55
+ diacritics = 'ًٌٍَُِّْٰ'
56
+ return ''.join(c for c in text if c not in diacritics)