aauss's picture
Update README.md
374b027 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Test of Time Accuracy
datasets:
  - baharef/ToT
  - aauss/ToT_separate_instructions
tags:
  - evaluate
  - metric
  - temporal reasoning
description: Accuracy metric for the Test of Time benchmark by Bahar et al. (2025).
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
emoji: πŸ“Š
colorFrom: gray
colorTo: indigo

Metric Card for Test of Time Accuracy

Metric Description

This metric is designed for the Test of Time (ToT) benchmark (Bahar et al., 2025). It measures the accuracy of model predictions against reference answers. The metric expects model outputs to be formatted as a JSON object (e.g., {"answer": "...", "explanation": "..."}).

It performs the following steps:

  1. Extracts the first valid JSON object from the model's prediction string.
  2. Processes the JSON based on the specified subset:
    • semantic: Extracts the value of the "answer" field.
    • arithmetic: Removes the "explanation" field and compares the remaining dictionary (containing the answer) to the reference.
  3. Compares the processed prediction with the reference to calculate accuracy, which is a dictionary for the artihmetic subset and a string for the semantic subset.

How to Use

You can load the metric using the evaluate library:

import evaluate
metric = evaluate.load("aauss/test_of_time_accuracy")

predictions = [
    '{"explanation": "Some explanation...", "unordered_list": ["London"]}',
    ' "Response without opening curly brackets...", "answer": "2005-04-07"}',
]

references = [
        '{"unordered_list": ["London"]}',
        "{'answer': '2005-04-07'}",
    ]

print(
    metric.compute(
        predictions=predictions,
        references=references,
        subset="arithmetic",
    )
)
>>> 0.5

print(
    metric.compute(
        predictions=predictions,
        references=references,
        subset="arithmetic",
        return_average=False
    )
)
>>> [True, False]

predictions = [
    '{"explanation": "Some explanation leading to a wrong answer...", "answer": 1}',
    '{"explanation": "Some explanation ...", "answer": "1985"}'
]

references = ["0", "1985"]

print(
    metric.compute(
        predictions=predictions,
        references=references,
        subset="semantic",
    )
)
>>> 0.5

Inputs

  • predictions (list of str): List of predictions to score. Each prediction should be a string that contains a JSON object (e.g., generated by an LLM).
  • references (list of str): List of reference answers.
  • subset (str): The subset of the benchmark being evaluated. Must be one of:
    • "arithmetic": Used for arithmetic tasks where the answer might need structure preservation (ignores explanation).
    • "semantic": Used for semantic tasks where only the "answer" value is compared.
  • return_average (bool, optional): If True, returns the average accuracy. If False, returns a list of boolean scores (correct/incorrect) for each sample. Defaults to True.

Output Values

The metric returns a dictionary with the following keys:

  • accuracy (float or list of bool): The accuracy score (0.0 to 1.0) if return_average=True, or a list of booleans indicating correctness per sample if return_average=False.

State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."

Values from Popular Papers

Checkout the original paper for some reference performances.

Limitations and Bias

  • The metric relies on json.JSONDecoder to find the first JSON object in the prediction string. If the model output is malformed or does not contain a valid JSON, extraction may fail (returning None), leading to an incorrect prediction.
  • It strictly expects the extracted JSON to follow the format as described in the task and optionally explanation for the logic to work as intended.

Citation

Evaluation was not described in more detail in the paper but we can expect that model answers were parsed to allow for a more robust evaluation.

@InProceedings{huggingface:module,
title = {Test of Time Accuracy},
authors={Auss Abbood},
year={2025}
}