Indo-to-Geser / README.md
AiRukua's picture
Update README.md
c075d3b verified
metadata
license: apache-2.0
metrics:
  - bleu
language:
  - id
  - ges
  - en
tags:
  - translation
  - marianmt
  - low-resource
  - endangered-language
library_name: transformers
pipeline_tag: translation
model-index:
  - name: Indo-to-Geser
    results:
      - task:
          name: Translation
          type: translation
        metrics:
          - name: BLEU
            type: bleu
            value: 26.36

Manuk MarianMT Translator (Indonesian β†’ Geser)

Model Description

This MarianMT-based model was trained for the Manuk project, focusing on the revitalisation of the Geser language (Eastern Seram).
It is fine-tuned on a custom parallel corpus containing 7,327 aligned sentences across Geser, Indonesian, and English.

Evaluation

The model was evaluated on the Indonesian β†’ Geser direction, achieving the following results:

Direction Val BLEU Test BLEU Val Loss Train Loss
Indonesian β†’ Geser 26.08 26.36 0.17 0.21

These results demonstrate that the model produces reasonable translations for a low-resource language pair.

Sample Corpus (Parallel Translations)

Domain Geser Indonesian
Traditional Medicine akar dirang mera ira, kalu me mancia dageit le ikea ima di nabagadik, dafaik dani akara baru datutu dabobar nai ikea ima di nabagadik ira. akar serai merah digunakan ketika seseorang jatuh dan mengalami patah pada kaki atau tangannya. akar tersebut diambil, lalu ditumbuk dan dibungkuskan pada bagian tubuh yang patah.
Family & Livelihood dodani, nugu abang tura nugu baba dasubelat daroka ikan wekan loka. moale, dodi datanak lau pasar ababis loka, jadi bot naresi oaca mo. tadi malam, abang dan ayah saya pergi memancing dan berhasil mendapatkan banyak ikan. akan tetapi, mereka menjual semuanya ke pasar hingga tersisa sedikit saja.

Requirements

This model was tested with Python 3.11. To use AiRukua/Indo-to-Geser, you need to install the following dependencies:

Minimal Installation

pip install torch sentencepiece sacremoses transformers

Recommended Installation (for Python 3.11 with CUDA support)

  1. Check your CUDA version:

    nvidia-smi
    
  2. Install PyTorch with the matching CUDA toolkit. Example (CUDA 12.1):

    pip install torch --index-url https://download.pytorch.org/whl/cu121
    

    For CPU only:

    pip install torch --index-url https://download.pytorch.org/whl/cpu
    
  3. Install the remaining dependencies:

    pip install sentencepiece sacremoses transformers
    

Usage

Download and Load Model

from transformers import MarianMTModel, MarianTokenizer

model_name = "AiRukua/Indo-to-Geser"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

Translation Functions

import torch

def translate(text, model_name="AiRukua/Indo-to-Geser", max_len=128, num_beams=4):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name).to(device)

    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_len,
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs, max_length=max_len, num_beams=num_beams
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def translate_dialogue(dialogue: str, model_name="AiRukua/Indo-to-Geser"):
    lines = dialogue.strip().split("\n")
    translated_lines = []

    for line in lines:
        if not line.strip():
            continue
        if ":" in line:
            speaker, text = line.split(":", 1)
            translated_text = translate(text.strip(), model_name)
            translated_lines.append(f"{speaker}: {translated_text}")
        else:
            translated_text = translate(line.strip(), model_name)
            translated_lines.append(translated_text)

    return "\n".join(translated_lines)

# Example usage
dialogue = """
Alice: Apa kabar hari ini?
Bob: Saya baik, terima kasih.
"""

print(translate_dialogue(dialogue))

Intended Use

  • Translation between Indonesian ↔ Geser
  • Research and education on endangered language technology
  • Community-driven language revitalisation projects