Hebrew Manuscript Joint Entity-Role Extraction
Fine-tuned DictaBERT model for joint Named Entity Recognition (NER) and Role Classification on Hebrew manuscript MARC records.
Model Description
This model performs two tasks simultaneously:
- Named Entity Recognition: Identifies PERSON entities in Hebrew text
- Role Classification: Classifies identified persons into roles (AUTHOR, SCRIBE, PATRON)
The joint training approach achieves superior performance compared to pipeline architectures by sharing representations between the two related tasks.
Key Features
- Multi-task learning: Joint optimization of NER and classification objectives
- Domain-adapted: Fine-tuned on historical Hebrew manuscripts
- Weak supervision: Trained using distant supervision from MARC catalog records
- Resource-efficient: Trained on consumer hardware (M1 Mac) in ~1 hour
Intended Use
Extract person names and their roles from Hebrew manuscript catalog records, particularly MARC format bibliographic descriptions.
Primary applications:
- Digital humanities: Manuscript cataloging
- Library science: Automated metadata extraction
- Historical research: Person-role relationship extraction
- Linked Open Data (LOD): Converting MARC to RDF triples
Training Details
Training Data
- Source: Hebrew manuscript MARC records
- Training samples: 8,794 (after data augmentation with entity substitution)
- Validation samples: 760
- Test samples: 799
- Annotation method: Distant supervision from structured MARC fields
Training Procedure
- Base model: dicta-il/dictabert
- Architecture: Joint model with shared BERT encoder + task-specific heads
- Epochs: 5
- Batch size: 4 (with gradient accumulation)
- Learning rate: 2e-5
- Lambda (task balance): 0.5
- Optimizer: AdamW
- Training time: ~1 hour on Apple M1 Mac
- Framework: PyTorch + Transformers
Multi-Task Loss
L_total = 位 * L_NER + (1 - 位) * L_classification
Where 位=0.5 balances the two tasks equally.
Evaluation
Validation Set Performance
| Task |
Metric |
Score |
| NER |
Precision |
88.00% |
| NER |
Recall |
91.00% |
| NER |
F1 Score |
89.40% |
| Classification |
Accuracy |
100.00% |
Test Set Performance
| Task |
Metric |
Score |
| NER |
Precision |
47.00% |
| NER |
Recall |
81.00% |
| NER |
F1 Score |
59.41% |
| Classification |
Accuracy |
100.00% |
Note: The gap between validation and test F1 suggests potential overfitting to validation distribution. Future work will address this with more diverse test data.
Comparison to Baseline
| Model |
Validation F1 |
Improvement |
| Baseline NER |
55.64% |
- |
| + CRF Layer |
84.39% |
+28.75 pp |
| Joint Model (This) |
89.40% |
+33.76 pp |
Usage
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "alexgoldberg/hebrew-manuscript-joint-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "谞讻转讘 注诇 讬讚讬 专' 讬注拽讘 讘谉 诪砖讛"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if token not in ['[CLS]', '[SEP]', '[PAD]']:
print(f"{token}: {label}")
Advanced Usage: Extract Entities
def extract_entities(text, model, tokenizer):
"""Extract PERSON entities from Hebrew text"""
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
entities = []
current_entity = []
for token, label in zip(tokens, labels):
if label == 'B-PERSON':
if current_entity:
entities.append(''.join(current_entity))
current_entity = [token.replace('##', '')]
elif label == 'I-PERSON' and current_entity:
current_entity.append(token.replace('##', ''))
else:
if current_entity:
entities.append(''.join(current_entity))
current_entity = []
if current_entity:
entities.append(''.join(current_entity))
return entities
text = "讻转讘 专' 讗讘专讛诐 讘谉 讚讜讚 讜讛注转讬拽 诪砖讛 讛住讜驻专"
entities = extract_entities(text, model, tokenizer)
print("Found entities:", entities)
Limitations
- Domain-specific: Optimized for Hebrew manuscript catalog records; performance may degrade on other text types
- Single entity type: Only identifies PERSON entities (not PLACE, DATE, WORK, etc.)
- Role coverage: Limited to AUTHOR, SCRIBE, PATRON roles
- Historical Hebrew: Best performance on historical/rabbinical Hebrew; may underperform on modern Hebrew
- Test set gap: Validation F1 (89.40%) significantly higher than test F1 (59.41%), indicating potential overfitting
Ethical Considerations
- Bias: Training data derived from library catalogs may reflect historical biases in manuscript preservation
- Cultural sensitivity: Model handles religious and cultural content; users should apply appropriate domain expertise
- Accuracy: Not suitable for critical applications without human review
Citation
If you use this model, please cite:
@misc{goldberg2025hebrewjoint,
author = {Goldberg, Alexander},
title = {Hebrew Manuscript Joint Entity-Role Extraction Model},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/alexgoldberg/hebrew-manuscript-joint-ner}
}
Contact
- Author: Alexander Goldberg
- Institution: Technion - Israel Institute of Technology
- Email: [email protected]
- Paper: [Link to paper when published]
Acknowledgments
- Base model: DictaBERT by Dicta team
- Dataset: Hebrew manuscript MARC records from multiple libraries
- Framework: HuggingFace Transformers
License
MIT License - See LICENSE file for details.
Model Card Authors
Alexander Goldberg
Model Card Contact
[email protected]