Hebrew Manuscript Joint Entity-Role Extraction

Fine-tuned DictaBERT model for joint Named Entity Recognition (NER) and Role Classification on Hebrew manuscript MARC records.

Model Description

This model performs two tasks simultaneously:

  1. Named Entity Recognition: Identifies PERSON entities in Hebrew text
  2. Role Classification: Classifies identified persons into roles (AUTHOR, SCRIBE, PATRON)

The joint training approach achieves superior performance compared to pipeline architectures by sharing representations between the two related tasks.

Key Features

  • Multi-task learning: Joint optimization of NER and classification objectives
  • Domain-adapted: Fine-tuned on historical Hebrew manuscripts
  • Weak supervision: Trained using distant supervision from MARC catalog records
  • Resource-efficient: Trained on consumer hardware (M1 Mac) in ~1 hour

Intended Use

Extract person names and their roles from Hebrew manuscript catalog records, particularly MARC format bibliographic descriptions.

Primary applications:

  • Digital humanities: Manuscript cataloging
  • Library science: Automated metadata extraction
  • Historical research: Person-role relationship extraction
  • Linked Open Data (LOD): Converting MARC to RDF triples

Training Details

Training Data

  • Source: Hebrew manuscript MARC records
  • Training samples: 8,794 (after data augmentation with entity substitution)
  • Validation samples: 760
  • Test samples: 799
  • Annotation method: Distant supervision from structured MARC fields

Training Procedure

  • Base model: dicta-il/dictabert
  • Architecture: Joint model with shared BERT encoder + task-specific heads
  • Epochs: 5
  • Batch size: 4 (with gradient accumulation)
  • Learning rate: 2e-5
  • Lambda (task balance): 0.5
  • Optimizer: AdamW
  • Training time: ~1 hour on Apple M1 Mac
  • Framework: PyTorch + Transformers

Multi-Task Loss

L_total = 位 * L_NER + (1 - 位) * L_classification

Where 位=0.5 balances the two tasks equally.

Evaluation

Validation Set Performance

Task Metric Score
NER Precision 88.00%
NER Recall 91.00%
NER F1 Score 89.40%
Classification Accuracy 100.00%

Test Set Performance

Task Metric Score
NER Precision 47.00%
NER Recall 81.00%
NER F1 Score 59.41%
Classification Accuracy 100.00%

Note: The gap between validation and test F1 suggests potential overfitting to validation distribution. Future work will address this with more diverse test data.

Comparison to Baseline

Model Validation F1 Improvement
Baseline NER 55.64% -
+ CRF Layer 84.39% +28.75 pp
Joint Model (This) 89.40% +33.76 pp

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "alexgoldberg/hebrew-manuscript-joint-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example Hebrew text from manuscript catalog
text = "谞讻转讘 注诇 讬讚讬 专' 讬注拽讘 讘谉 诪砖讛"

# Tokenize
inputs = tokenizer(text, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

# Print results
for token, label in zip(tokens, labels):
    if token not in ['[CLS]', '[SEP]', '[PAD]']:
        print(f"{token}: {label}")

Advanced Usage: Extract Entities

def extract_entities(text, model, tokenizer):
    """Extract PERSON entities from Hebrew text"""
    inputs = tokenizer(text, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=-1)
    
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [model.config.id2label[p.item()] for p in predictions[0]]
    
    entities = []
    current_entity = []
    
    for token, label in zip(tokens, labels):
        if label == 'B-PERSON':
            if current_entity:
                entities.append(''.join(current_entity))
            current_entity = [token.replace('##', '')]
        elif label == 'I-PERSON' and current_entity:
            current_entity.append(token.replace('##', ''))
        else:
            if current_entity:
                entities.append(''.join(current_entity))
                current_entity = []
    
    if current_entity:
        entities.append(''.join(current_entity))
    
    return entities

# Example
text = "讻转讘 专' 讗讘专讛诐 讘谉 讚讜讚 讜讛注转讬拽 诪砖讛 讛住讜驻专"
entities = extract_entities(text, model, tokenizer)
print("Found entities:", entities)
# Output: ['讗讘专讛诐 讘谉 讚讜讚', '诪砖讛 讛住讜驻专']

Limitations

  1. Domain-specific: Optimized for Hebrew manuscript catalog records; performance may degrade on other text types
  2. Single entity type: Only identifies PERSON entities (not PLACE, DATE, WORK, etc.)
  3. Role coverage: Limited to AUTHOR, SCRIBE, PATRON roles
  4. Historical Hebrew: Best performance on historical/rabbinical Hebrew; may underperform on modern Hebrew
  5. Test set gap: Validation F1 (89.40%) significantly higher than test F1 (59.41%), indicating potential overfitting

Ethical Considerations

  • Bias: Training data derived from library catalogs may reflect historical biases in manuscript preservation
  • Cultural sensitivity: Model handles religious and cultural content; users should apply appropriate domain expertise
  • Accuracy: Not suitable for critical applications without human review

Citation

If you use this model, please cite:

@misc{goldberg2025hebrewjoint,
  author = {Goldberg, Alexander},
  title = {Hebrew Manuscript Joint Entity-Role Extraction Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/alexgoldberg/hebrew-manuscript-joint-ner}
}

Contact

  • Author: Alexander Goldberg
  • Institution: Technion - Israel Institute of Technology
  • Email: [email protected]
  • Paper: [Link to paper when published]

Acknowledgments

  • Base model: DictaBERT by Dicta team
  • Dataset: Hebrew manuscript MARC records from multiple libraries
  • Framework: HuggingFace Transformers

License

MIT License - See LICENSE file for details.

Model Card Authors

Alexander Goldberg

Model Card Contact

[email protected]

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Evaluation results