---
title: Test of Time Accuracy
datasets:
- baharef/ToT
- aauss/ToT_separate_instructions
tags:
- evaluate
- metric
- temporal reasoning
description: Accuracy metric for the Test of Time benchmark by Bahar et al. (2025).
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
emoji: 📊
colorFrom: gray
colorTo: indigo
---

# Metric Card for Test of Time Accuracy

## Metric Description

This metric is designed for the **Test of Time (ToT)** benchmark (Bahar et al., 2025). It measures the accuracy of model predictions against reference answers. The metric expects model outputs to be formatted as a JSON object (e.g., `{"answer": "...", "explanation": "..."}`).

It performs the following steps:

1. Extracts the first valid JSON object from the model's prediction string.
2. Processes the JSON based on the specified `subset`:
   - **semantic**: Extracts the value of the "answer" field.
   - **arithmetic**: Removes the "explanation" field and compares the remaining dictionary (containing the answer) to the reference.
3. Compares the processed prediction with the reference to calculate accuracy, which is a dictionary for the artihmetic subset and a string for the semantic subset.

## How to Use

You can load the metric using the `evaluate` library:

```python
import evaluate
metric = evaluate.load("aauss/test_of_time_accuracy")

predictions = [
    '{"explanation": "Some explanation...", "unordered_list": ["London"]}',
    ' "Response without opening curly brackets...", "answer": "2005-04-07"}',
]

references = [
        '{"unordered_list": ["London"]}',
        "{'answer': '2005-04-07'}",
    ]

print(
    metric.compute(
        predictions=predictions,
        references=references,
        subset="arithmetic",
    )
)
>>> 0.5

print(
    metric.compute(
        predictions=predictions,
        references=references,
        subset="arithmetic",
        return_average=False
    )
)
>>> [True, False]

predictions = [
    '{"explanation": "Some explanation leading to a wrong answer...", "answer": 1}',
    '{"explanation": "Some explanation ...", "answer": "1985"}'
]

references = ["0", "1985"]

print(
    metric.compute(
        predictions=predictions,
        references=references,
        subset="semantic",
    )
)
>>> 0.5
```

### Inputs

- **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string that contains a JSON object (e.g., generated by an LLM).
- **references** (`list` of `str`): List of reference answers.
- **subset** (`str`): The subset of the benchmark being evaluated. Must be one of:
  - `"arithmetic"`: Used for arithmetic tasks where the answer might need structure preservation (ignores explanation).
  - `"semantic"`: Used for semantic tasks where only the "answer" value is compared.
- **return_average** (`bool`, optional): If `True`, returns the average accuracy. If `False`, returns a list of boolean scores (correct/incorrect) for each sample. Defaults to `True`.

### Output Values

The metric returns a dictionary with the following keys:

- **accuracy** (`float` or `list` of `bool`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of booleans indicating correctness per sample if `return_average=False`.

*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*

#### Values from Popular Papers

Checkout the original [paper](https://openreview.net/pdf?id=44CoQe6VCq) for some reference performances.


## Limitations and Bias

- The metric relies on `json.JSONDecoder` to find the first JSON object in the prediction string. If the model output is malformed or does not contain a valid JSON, extraction may fail (returning `None`), leading to an incorrect prediction.
- It strictly expects the extracted JSON to follow the format as described in the task and optionally `explanation` for the logic to work as intended.

## Citation

Evaluation was not described in more detail in the paper but we can expect that model answers were parsed to allow for a more robust evaluation.

```bibtex
@InProceedings{huggingface:module,
title = {Test of Time Accuracy},
authors={Auss Abbood},
year={2025}
}
```