Manuk MarianMT Translator (Indonesian → Geser)

Model Description

This MarianMT-based model was trained for the Manuk project, focusing on the revitalisation of the Geser language (Eastern Seram).
It is fine-tuned on a custom parallel corpus containing 7,327 aligned sentences across Geser, Indonesian, and English.

Evaluation

The model was evaluated on the Indonesian → Geser direction, achieving the following results:

Direction	Val BLEU	Test BLEU	Val Loss	Train Loss
Indonesian → Geser	26.08	26.36	0.17	0.21

These results demonstrate that the model produces reasonable translations for a low-resource language pair.

Sample Corpus (Parallel Translations)

Domain	Geser	Indonesian
Traditional Medicine	akar dirang mera ira, kalu me mancia dageit le ikea ima di nabagadik, dafaik dani akara baru datutu dabobar nai ikea ima di nabagadik ira.	akar serai merah digunakan ketika seseorang jatuh dan mengalami patah pada kaki atau tangannya. akar tersebut diambil, lalu ditumbuk dan dibungkuskan pada bagian tubuh yang patah.
Family & Livelihood	dodani, nugu abang tura nugu baba dasubelat daroka ikan wekan loka. moale, dodi datanak lau pasar ababis loka, jadi bot naresi oaca mo.	tadi malam, abang dan ayah saya pergi memancing dan berhasil mendapatkan banyak ikan. akan tetapi, mereka menjual semuanya ke pasar hingga tersisa sedikit saja.

Requirements

This model was tested with Python 3.11. To use AiRukua/Indo-to-Geser, you need to install the following dependencies:

Minimal Installation

pip install torch sentencepiece sacremoses transformers

Recommended Installation (for Python 3.11 with CUDA support)

Check your CUDA version:
```
nvidia-smi
```

Install PyTorch with the matching CUDA toolkit. Example (CUDA 12.1):

pip install torch --index-url https://download.pytorch.org/whl/cu121

For CPU only:

pip install torch --index-url https://download.pytorch.org/whl/cpu

Install the remaining dependencies:

pip install sentencepiece sacremoses transformers

Usage

Download and Load Model

from transformers import MarianMTModel, MarianTokenizer

model_name = "AiRukua/Indo-to-Geser"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

Translation Functions

import torch

def translate(text, model_name="AiRukua/Indo-to-Geser", max_len=128, num_beams=4):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name).to(device)

    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_len,
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs, max_length=max_len, num_beams=num_beams
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def translate_dialogue(dialogue: str, model_name="AiRukua/Indo-to-Geser"):
    lines = dialogue.strip().split("\n")
    translated_lines = []

    for line in lines:
        if not line.strip():
            continue
        if ":" in line:
            speaker, text = line.split(":", 1)
            translated_text = translate(text.strip(), model_name)
            translated_lines.append(f"{speaker}: {translated_text}")
        else:
            translated_text = translate(line.strip(), model_name)
            translated_lines.append(translated_text)

    return "\n".join(translated_lines)

# Example usage
dialogue = """
Alice: Apa kabar hari ini?
Bob: Saya baik, terima kasih.
"""

print(translate_dialogue(dialogue))

Intended Use

Translation between Indonesian ↔ Geser
Research and education on endangered language technology
Community-driven language revitalisation projects

Downloads last month: 14

Safetensors

Model size

72.2M params

Tensor type

F32

Evaluation results

BLEU
self-reported

26.360

Metadata error: specify a dataset to view leaderboard