alexgoldberg commited on
Commit
2430c6f
·
verified ·
1 Parent(s): 6079d7a

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ vocab.txt filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - he
4
+ license: mit
5
+ tags:
6
+ - named-entity-recognition
7
+ - token-classification
8
+ - hebrew
9
+ - historical-manuscripts
10
+ - joint-learning
11
+ - multi-task-learning
12
+ datasets:
13
+ - custom
14
+ metrics:
15
+ - f1
16
+ - accuracy
17
+ model-index:
18
+ - name: hebrew-manuscript-joint-ner
19
+ results:
20
+ - task:
21
+ type: token-classification
22
+ name: Named Entity Recognition
23
+ metrics:
24
+ - type: f1
25
+ value: 89.40
26
+ name: Validation F1
27
+ - type: f1
28
+ value: 59.41
29
+ name: Test F1
30
+ - task:
31
+ type: text-classification
32
+ name: Role Classification
33
+ metrics:
34
+ - type: accuracy
35
+ value: 100.0
36
+ name: Validation Accuracy
37
+ - type: accuracy
38
+ value: 100.0
39
+ name: Test Accuracy
40
+ ---
41
+
42
+ # Hebrew Manuscript Joint Entity-Role Extraction
43
+
44
+ Fine-tuned [DictaBERT](https://huggingface.co/dicta-il/dictabert) model for joint Named Entity Recognition (NER) and Role Classification on Hebrew manuscript MARC records.
45
+
46
+ ## Model Description
47
+
48
+ This model performs two tasks simultaneously:
49
+ 1. **Named Entity Recognition**: Identifies PERSON entities in Hebrew text
50
+ 2. **Role Classification**: Classifies identified persons into roles (AUTHOR, SCRIBE, PATRON)
51
+
52
+ The joint training approach achieves superior performance compared to pipeline architectures by sharing representations between the two related tasks.
53
+
54
+ ### Key Features
55
+
56
+ - **Multi-task learning**: Joint optimization of NER and classification objectives
57
+ - **Domain-adapted**: Fine-tuned on historical Hebrew manuscripts
58
+ - **Weak supervision**: Trained using distant supervision from MARC catalog records
59
+ - **Resource-efficient**: Trained on consumer hardware (M1 Mac) in ~1 hour
60
+
61
+ ## Intended Use
62
+
63
+ Extract person names and their roles from Hebrew manuscript catalog records, particularly MARC format bibliographic descriptions.
64
+
65
+ **Primary applications**:
66
+ - Digital humanities: Manuscript cataloging
67
+ - Library science: Automated metadata extraction
68
+ - Historical research: Person-role relationship extraction
69
+ - Linked Open Data (LOD): Converting MARC to RDF triples
70
+
71
+ ## Training Details
72
+
73
+ ### Training Data
74
+
75
+ - **Source**: Hebrew manuscript MARC records
76
+ - **Training samples**: 8,794 (after data augmentation with entity substitution)
77
+ - **Validation samples**: 760
78
+ - **Test samples**: 799
79
+ - **Annotation method**: Distant supervision from structured MARC fields
80
+
81
+ ### Training Procedure
82
+
83
+ - **Base model**: dicta-il/dictabert
84
+ - **Architecture**: Joint model with shared BERT encoder + task-specific heads
85
+ - **Epochs**: 5
86
+ - **Batch size**: 4 (with gradient accumulation)
87
+ - **Learning rate**: 2e-5
88
+ - **Lambda (task balance)**: 0.5
89
+ - **Optimizer**: AdamW
90
+ - **Training time**: ~1 hour on Apple M1 Mac
91
+ - **Framework**: PyTorch + Transformers
92
+
93
+ ### Multi-Task Loss
94
+
95
+ ```
96
+ L_total = λ * L_NER + (1 - λ) * L_classification
97
+ ```
98
+
99
+ Where λ=0.5 balances the two tasks equally.
100
+
101
+ ## Evaluation
102
+
103
+ ### Validation Set Performance
104
+
105
+ | Task | Metric | Score |
106
+ |------|--------|-------|
107
+ | NER | Precision | 88.00% |
108
+ | NER | Recall | 91.00% |
109
+ | NER | F1 Score | **89.40%** |
110
+ | Classification | Accuracy | **100.00%** |
111
+
112
+ ### Test Set Performance
113
+
114
+ | Task | Metric | Score |
115
+ |------|--------|-------|
116
+ | NER | Precision | 47.00% |
117
+ | NER | Recall | 81.00% |
118
+ | NER | F1 Score | 59.41% |
119
+ | Classification | Accuracy | 100.00% |
120
+
121
+ **Note**: The gap between validation and test F1 suggests potential overfitting to validation distribution. Future work will address this with more diverse test data.
122
+
123
+ ### Comparison to Baseline
124
+
125
+ | Model | Validation F1 | Improvement |
126
+ |-------|---------------|--------------|
127
+ | Baseline NER | 55.64% | - |
128
+ | + CRF Layer | 84.39% | +28.75 pp |
129
+ | **Joint Model (This)** | **89.40%** | **+33.76 pp** |
130
+
131
+ ## Usage
132
+
133
+ ### Installation
134
+
135
+ ```bash
136
+ pip install transformers torch
137
+ ```
138
+
139
+ ### Basic Usage
140
+
141
+ ```python
142
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
143
+ import torch
144
+
145
+ # Load model and tokenizer
146
+ model_name = "alexgoldberg/hebrew-manuscript-joint-ner"
147
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
148
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
149
+
150
+ # Example Hebrew text from manuscript catalog
151
+ text = "נכתב על ידי ר' יעקב בן משה"
152
+
153
+ # Tokenize
154
+ inputs = tokenizer(text, return_tensors="pt")
155
+
156
+ # Get predictions
157
+ with torch.no_grad():
158
+ outputs = model(**inputs)
159
+ predictions = torch.argmax(outputs.logits, dim=-1)
160
+
161
+ # Decode predictions
162
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
163
+ labels = [model.config.id2label[p.item()] for p in predictions[0]]
164
+
165
+ # Print results
166
+ for token, label in zip(tokens, labels):
167
+ if token not in ['[CLS]', '[SEP]', '[PAD]']:
168
+ print(f"{token}: {label}")
169
+ ```
170
+
171
+ ### Advanced Usage: Extract Entities
172
+
173
+ ```python
174
+ def extract_entities(text, model, tokenizer):
175
+ """Extract PERSON entities from Hebrew text"""
176
+ inputs = tokenizer(text, return_tensors="pt")
177
+
178
+ with torch.no_grad():
179
+ outputs = model(**inputs)
180
+ predictions = torch.argmax(outputs.logits, dim=-1)
181
+
182
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
183
+ labels = [model.config.id2label[p.item()] for p in predictions[0]]
184
+
185
+ entities = []
186
+ current_entity = []
187
+
188
+ for token, label in zip(tokens, labels):
189
+ if label == 'B-PERSON':
190
+ if current_entity:
191
+ entities.append(''.join(current_entity))
192
+ current_entity = [token.replace('##', '')]
193
+ elif label == 'I-PERSON' and current_entity:
194
+ current_entity.append(token.replace('##', ''))
195
+ else:
196
+ if current_entity:
197
+ entities.append(''.join(current_entity))
198
+ current_entity = []
199
+
200
+ if current_entity:
201
+ entities.append(''.join(current_entity))
202
+
203
+ return entities
204
+
205
+ # Example
206
+ text = "כתב ר' אברהם בן דוד והעתיק משה הסופר"
207
+ entities = extract_entities(text, model, tokenizer)
208
+ print("Found entities:", entities)
209
+ # Output: ['אברהם בן דוד', 'משה הסופר']
210
+ ```
211
+
212
+ ## Limitations
213
+
214
+ 1. **Domain-specific**: Optimized for Hebrew manuscript catalog records; performance may degrade on other text types
215
+ 2. **Single entity type**: Only identifies PERSON entities (not PLACE, DATE, WORK, etc.)
216
+ 3. **Role coverage**: Limited to AUTHOR, SCRIBE, PATRON roles
217
+ 4. **Historical Hebrew**: Best performance on historical/rabbinical Hebrew; may underperform on modern Hebrew
218
+ 5. **Test set gap**: Validation F1 (89.40%) significantly higher than test F1 (59.41%), indicating potential overfitting
219
+
220
+ ## Ethical Considerations
221
+
222
+ - **Bias**: Training data derived from library catalogs may reflect historical biases in manuscript preservation
223
+ - **Cultural sensitivity**: Model handles religious and cultural content; users should apply appropriate domain expertise
224
+ - **Accuracy**: Not suitable for critical applications without human review
225
+
226
+ ## Citation
227
+
228
+ If you use this model, please cite:
229
+
230
+ ```bibtex
231
+ @misc{goldberg2025hebrewjoint,
232
+ author = {Goldberg, Alexander},
233
+ title = {Hebrew Manuscript Joint Entity-Role Extraction Model},
234
+ year = {2025},
235
+ publisher = {HuggingFace},
236
+ url = {https://huggingface.co/alexgoldberg/hebrew-manuscript-joint-ner}
237
+ }
238
+ ```
239
+
240
+ ## Contact
241
+
242
+ - **Author**: Alexander Goldberg
243
+ - **Institution**: Technion - Israel Institute of Technology
244
+ - **Email**: [email protected]
245
+ - **Paper**: [Link to paper when published]
246
+
247
+ ## Acknowledgments
248
+
249
+ - **Base model**: DictaBERT by Dicta team
250
+ - **Dataset**: Hebrew manuscript MARC records from multiple libraries
251
+ - **Framework**: HuggingFace Transformers
252
+
253
+ ## License
254
+
255
+ MIT License - See LICENSE file for details.
256
+
257
+ ## Model Card Authors
258
+
259
+ Alexander Goldberg
260
+
261
+ ## Model Card Contact
262
+
263
config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForTokenClassification"
4
+ ],
5
+ "model_type": "bert",
6
+ "num_labels": 3,
7
+ "id2label": {
8
+ "0": "O",
9
+ "1": "B-PERSON",
10
+ "2": "I-PERSON"
11
+ },
12
+ "label2id": {
13
+ "O": 0,
14
+ "B-PERSON": 1,
15
+ "I-PERSON": 2
16
+ },
17
+ "base_model": "dicta-il/dictabert"
18
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac9cd305a2757b25f972b3202591bc9ee89611ba4181b568fff5e4044be36a9b
3
+ size 2219554257
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[UNK]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[CLS]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[SEP]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[PAD]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "[BLANK]",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "clean_up_tokenization_spaces": true,
53
+ "cls_token": "[CLS]",
54
+ "do_lower_case": true,
55
+ "extra_special_tokens": {},
56
+ "mask_token": "[MASK]",
57
+ "model_max_length": 512,
58
+ "pad_token": "[PAD]",
59
+ "sep_token": "[SEP]",
60
+ "strip_accents": null,
61
+ "tokenize_chinese_chars": true,
62
+ "tokenizer_class": "BertTokenizer",
63
+ "unk_token": "[UNK]"
64
+ }
training_info.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "framework": "pytorch",
3
+ "training_time_hours": 1,
4
+ "hardware": "Apple M1 Mac",
5
+ "training_samples": 8794,
6
+ "validation_samples": 760,
7
+ "test_samples": 799,
8
+ "best_validation_f1": 89.4,
9
+ "best_validation_accuracy": 100.0,
10
+ "test_f1": 59.41,
11
+ "test_accuracy": 100.0
12
+ }
vocab.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0fb90bfa35244d26f0065d1fcd0b5becc3da3d44d616a7e2aacaf6320b9fa2d0
3
+ size 1500244