Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +263 -0
config.json +18 -0
pytorch_model.bin +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +64 -0
training_info.json +12 -0
vocab.txt +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+vocab.txt filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,263 @@

+---
+language:
+- he
+license: mit
+tags:
+- named-entity-recognition
+- token-classification
+- hebrew
+- historical-manuscripts
+- joint-learning
+- multi-task-learning
+datasets:
+- custom
+metrics:
+- f1
+- accuracy
+model-index:
+- name: hebrew-manuscript-joint-ner
+  results:
+  - task:
+      type: token-classification
+      name: Named Entity Recognition
+    metrics:
+    - type: f1
+      value: 89.40
+      name: Validation F1
+    - type: f1
+      value: 59.41
+      name: Test F1
+  - task:
+      type: text-classification
+      name: Role Classification
+    metrics:
+    - type: accuracy
+      value: 100.0
+      name: Validation Accuracy
+    - type: accuracy
+      value: 100.0
+      name: Test Accuracy
+---
+# Hebrew Manuscript Joint Entity-Role Extraction
+Fine-tuned [DictaBERT](https://huggingface.co/dicta-il/dictabert) model for joint Named Entity Recognition (NER) and Role Classification on Hebrew manuscript MARC records.
+## Model Description
+This model performs two tasks simultaneously:
+1. **Named Entity Recognition**: Identifies PERSON entities in Hebrew text
+2. **Role Classification**: Classifies identified persons into roles (AUTHOR, SCRIBE, PATRON)
+The joint training approach achieves superior performance compared to pipeline architectures by sharing representations between the two related tasks.
+### Key Features
+- **Multi-task learning**: Joint optimization of NER and classification objectives
+- **Domain-adapted**: Fine-tuned on historical Hebrew manuscripts
+- **Weak supervision**: Trained using distant supervision from MARC catalog records
+- **Resource-efficient**: Trained on consumer hardware (M1 Mac) in ~1 hour
+## Intended Use
+Extract person names and their roles from Hebrew manuscript catalog records, particularly MARC format bibliographic descriptions.
+**Primary applications**:
+- Digital humanities: Manuscript cataloging
+- Library science: Automated metadata extraction
+- Historical research: Person-role relationship extraction
+- Linked Open Data (LOD): Converting MARC to RDF triples
+## Training Details
+### Training Data
+- **Source**: Hebrew manuscript MARC records
+- **Training samples**: 8,794 (after data augmentation with entity substitution)
+- **Validation samples**: 760
+- **Test samples**: 799
+- **Annotation method**: Distant supervision from structured MARC fields
+### Training Procedure
+- **Base model**: dicta-il/dictabert
+- **Architecture**: Joint model with shared BERT encoder + task-specific heads
+- **Epochs**: 5
+- **Batch size**: 4 (with gradient accumulation)
+- **Learning rate**: 2e-5
+- **Lambda (task balance)**: 0.5
+- **Optimizer**: AdamW
+- **Training time**: ~1 hour on Apple M1 Mac
+- **Framework**: PyTorch + Transformers
+### Multi-Task Loss
+```
+L_total = λ * L_NER + (1 - λ) * L_classification
+```
+Where λ=0.5 balances the two tasks equally.
+## Evaluation
+### Validation Set Performance
+| Task | Metric | Score |
+|------|--------|-------|
+| NER | Precision | 88.00% |
+| NER | Recall | 91.00% |
+| NER | F1 Score | **89.40%** |
+| Classification | Accuracy | **100.00%** |
+### Test Set Performance
+| Task | Metric | Score |
+|------|--------|-------|
+| NER | Precision | 47.00% |
+| NER | Recall | 81.00% |
+| NER | F1 Score | 59.41% |
+| Classification | Accuracy | 100.00% |
+**Note**: The gap between validation and test F1 suggests potential overfitting to validation distribution. Future work will address this with more diverse test data.
+### Comparison to Baseline
+| Model | Validation F1 |  Improvement |
+|-------|---------------|--------------|
+| Baseline NER | 55.64% | - |
+| + CRF Layer | 84.39% | +28.75 pp |
+| **Joint Model (This)** | **89.40%** | **+33.76 pp** |
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Basic Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+# Load model and tokenizer
+model_name = "alexgoldberg/hebrew-manuscript-joint-ner"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+# Example Hebrew text from manuscript catalog
+text = "נכתב על ידי ר' יעקב בן משה"
+# Tokenize
+inputs = tokenizer(text, return_tensors="pt")
+# Get predictions
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.argmax(outputs.logits, dim=-1)
+# Decode predictions
+tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+labels = [model.config.id2label[p.item()] for p in predictions[0]]
+# Print results
+for token, label in zip(tokens, labels):
+    if token not in ['[CLS]', '[SEP]', '[PAD]']:
+        print(f"{token}: {label}")
+```
+### Advanced Usage: Extract Entities
+```python
+def extract_entities(text, model, tokenizer):
+    """Extract PERSON entities from Hebrew text"""
+    inputs = tokenizer(text, return_tensors="pt")
+    with torch.no_grad():
+        outputs = model(**inputs)
+        predictions = torch.argmax(outputs.logits, dim=-1)
+    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+    labels = [model.config.id2label[p.item()] for p in predictions[0]]
+    entities = []
+    current_entity = []
+    for token, label in zip(tokens, labels):
+        if label == 'B-PERSON':
+            if current_entity:
+                entities.append(''.join(current_entity))
+            current_entity = [token.replace('##', '')]
+        elif label == 'I-PERSON' and current_entity:
+            current_entity.append(token.replace('##', ''))
+        else:
+            if current_entity:
+                entities.append(''.join(current_entity))
+                current_entity = []
+    if current_entity:
+        entities.append(''.join(current_entity))
+    return entities
+# Example
+text = "כתב ר' אברהם בן דוד והעתיק משה הסופר"
+entities = extract_entities(text, model, tokenizer)
+print("Found entities:", entities)
+# Output: ['אברהם בן דוד', 'משה הסופר']
+```
+## Limitations
+1. **Domain-specific**: Optimized for Hebrew manuscript catalog records; performance may degrade on other text types
+2. **Single entity type**: Only identifies PERSON entities (not PLACE, DATE, WORK, etc.)
+3. **Role coverage**: Limited to AUTHOR, SCRIBE, PATRON roles
+4. **Historical Hebrew**: Best performance on historical/rabbinical Hebrew; may underperform on modern Hebrew
+5. **Test set gap**: Validation F1 (89.40%) significantly higher than test F1 (59.41%), indicating potential overfitting
+## Ethical Considerations
+- **Bias**: Training data derived from library catalogs may reflect historical biases in manuscript preservation
+- **Cultural sensitivity**: Model handles religious and cultural content; users should apply appropriate domain expertise
+- **Accuracy**: Not suitable for critical applications without human review
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{goldberg2025hebrewjoint,
+  author = {Goldberg, Alexander},
+  title = {Hebrew Manuscript Joint Entity-Role Extraction Model},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/alexgoldberg/hebrew-manuscript-joint-ner}
+}
+```
+## Contact
+- **Author**: Alexander Goldberg
+- **Institution**: Technion - Israel Institute of Technology
+- **Email**: [email protected]
+- **Paper**: [Link to paper when published]
+## Acknowledgments
+- **Base model**: DictaBERT by Dicta team
+- **Dataset**: Hebrew manuscript MARC records from multiple libraries
+- **Framework**: HuggingFace Transformers
+## License
+MIT License - See LICENSE file for details.
+## Model Card Authors
+Alexander Goldberg
+## Model Card Contact
+[email protected]

config.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "architectures": [
+    "BertForTokenClassification"
+  ],
+  "model_type": "bert",
+  "num_labels": 3,
+  "id2label": {
+    "0": "O",
+    "1": "B-PERSON",
+    "2": "I-PERSON"
+  },
+  "label2id": {
+    "O": 0,
+    "B-PERSON": 1,
+    "I-PERSON": 2
+  },
+  "base_model": "dicta-il/dictabert"
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ac9cd305a2757b25f972b3202591bc9ee89611ba4181b568fff5e4044be36a9b
+size 2219554257

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "[BLANK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_info.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "framework": "pytorch",
+  "training_time_hours": 1,
+  "hardware": "Apple M1 Mac",
+  "training_samples": 8794,
+  "validation_samples": 760,
+  "test_samples": 799,
+  "best_validation_f1": 89.4,
+  "best_validation_accuracy": 100.0,
+  "test_f1": 59.41,
+  "test_accuracy": 100.0
+}

vocab.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0fb90bfa35244d26f0065d1fcd0b5becc3da3d44d616a7e2aacaf6320b9fa2d0
+size 1500244