Manuk MarianMT Translator (Indonesian β†’ Geser)

Model Description

This MarianMT-based model was trained for the Manuk project, focusing on the revitalisation of the Geser language (Eastern Seram).
It is fine-tuned on a custom parallel corpus containing 7,327 aligned sentences across Geser, Indonesian, and English.

Evaluation

The model was evaluated on the Indonesian β†’ Geser direction, achieving the following results:

Direction Val BLEU Test BLEU Val Loss Train Loss
Indonesian β†’ Geser 26.08 26.36 0.17 0.21

These results demonstrate that the model produces reasonable translations for a low-resource language pair.

Sample Corpus (Parallel Translations)

Domain Geser Indonesian
Traditional Medicine akar dirang mera ira, kalu me mancia dageit le ikea ima di nabagadik, dafaik dani akara baru datutu dabobar nai ikea ima di nabagadik ira. akar serai merah digunakan ketika seseorang jatuh dan mengalami patah pada kaki atau tangannya. akar tersebut diambil, lalu ditumbuk dan dibungkuskan pada bagian tubuh yang patah.
Family & Livelihood dodani, nugu abang tura nugu baba dasubelat daroka ikan wekan loka. moale, dodi datanak lau pasar ababis loka, jadi bot naresi oaca mo. tadi malam, abang dan ayah saya pergi memancing dan berhasil mendapatkan banyak ikan. akan tetapi, mereka menjual semuanya ke pasar hingga tersisa sedikit saja.

Requirements

This model was tested with Python 3.11. To use AiRukua/Indo-to-Geser, you need to install the following dependencies:

Minimal Installation

pip install torch sentencepiece sacremoses transformers

Recommended Installation (for Python 3.11 with CUDA support)

  1. Check your CUDA version:

    nvidia-smi
    
  2. Install PyTorch with the matching CUDA toolkit. Example (CUDA 12.1):

    pip install torch --index-url https://download.pytorch.org/whl/cu121
    

    For CPU only:

    pip install torch --index-url https://download.pytorch.org/whl/cpu
    
  3. Install the remaining dependencies:

    pip install sentencepiece sacremoses transformers
    

Usage

Download and Load Model

from transformers import MarianMTModel, MarianTokenizer

model_name = "AiRukua/Indo-to-Geser"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

Translation Functions

import torch

def translate(text, model_name="AiRukua/Indo-to-Geser", max_len=128, num_beams=4):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name).to(device)

    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_len,
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs, max_length=max_len, num_beams=num_beams
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def translate_dialogue(dialogue: str, model_name="AiRukua/Indo-to-Geser"):
    lines = dialogue.strip().split("\n")
    translated_lines = []

    for line in lines:
        if not line.strip():
            continue
        if ":" in line:
            speaker, text = line.split(":", 1)
            translated_text = translate(text.strip(), model_name)
            translated_lines.append(f"{speaker}: {translated_text}")
        else:
            translated_text = translate(line.strip(), model_name)
            translated_lines.append(translated_text)

    return "\n".join(translated_lines)

# Example usage
dialogue = """
Alice: Apa kabar hari ini?
Bob: Saya baik, terima kasih.
"""

print(translate_dialogue(dialogue))

Intended Use

  • Translation between Indonesian ↔ Geser
  • Research and education on endangered language technology
  • Community-driven language revitalisation projects
Downloads last month
14
Safetensors
Model size
72.2M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support