File size: 4,307 Bytes
6961e24
b965a6e
 
29a0e42
b1cc818
b965a6e
 
 
29a0e42
b1cc818
6961e24
b1cc818
6961e24
 
b1cc818
374b027
 
6961e24
 
b965a6e
 
 
29a0e42
 
 
 
 
 
 
 
 
 
b965a6e
 
 
29a0e42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b965a6e
 
29a0e42
 
 
 
 
 
 
b965a6e
 
 
29a0e42
 
 
b965a6e
 
 
 
 
29a0e42
 
b965a6e
 
29a0e42
 
 
b965a6e
 
 
29a0e42
 
 
 
 
 
 
 
b1cc818
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: Test of Time Accuracy
datasets:
- baharef/ToT
- aauss/ToT_separate_instructions
tags:
- evaluate
- metric
- temporal reasoning
description: Accuracy metric for the Test of Time benchmark by Bahar et al. (2025).
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
emoji: πŸ“Š
colorFrom: gray
colorTo: indigo
---

# Metric Card for Test of Time Accuracy

## Metric Description

This metric is designed for the **Test of Time (ToT)** benchmark (Bahar et al., 2025). It measures the accuracy of model predictions against reference answers. The metric expects model outputs to be formatted as a JSON object (e.g., `{"answer": "...", "explanation": "..."}`).

It performs the following steps:

1. Extracts the first valid JSON object from the model's prediction string.
2. Processes the JSON based on the specified `subset`:
   - **semantic**: Extracts the value of the "answer" field.
   - **arithmetic**: Removes the "explanation" field and compares the remaining dictionary (containing the answer) to the reference.
3. Compares the processed prediction with the reference to calculate accuracy, which is a dictionary for the artihmetic subset and a string for the semantic subset.

## How to Use

You can load the metric using the `evaluate` library:

```python
import evaluate
metric = evaluate.load("aauss/test_of_time_accuracy")

predictions = [
    '{"explanation": "Some explanation...", "unordered_list": ["London"]}',
    ' "Response without opening curly brackets...", "answer": "2005-04-07"}',
]

references = [
        '{"unordered_list": ["London"]}',
        "{'answer': '2005-04-07'}",
    ]

print(
    metric.compute(
        predictions=predictions,
        references=references,
        subset="arithmetic",
    )
)
>>> 0.5

print(
    metric.compute(
        predictions=predictions,
        references=references,
        subset="arithmetic",
        return_average=False
    )
)
>>> [True, False]

predictions = [
    '{"explanation": "Some explanation leading to a wrong answer...", "answer": 1}',
    '{"explanation": "Some explanation ...", "answer": "1985"}'
]

references = ["0", "1985"]

print(
    metric.compute(
        predictions=predictions,
        references=references,
        subset="semantic",
    )
)
>>> 0.5
```

### Inputs

- **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string that contains a JSON object (e.g., generated by an LLM).
- **references** (`list` of `str`): List of reference answers.
- **subset** (`str`): The subset of the benchmark being evaluated. Must be one of:
  - `"arithmetic"`: Used for arithmetic tasks where the answer might need structure preservation (ignores explanation).
  - `"semantic"`: Used for semantic tasks where only the "answer" value is compared.
- **return_average** (`bool`, optional): If `True`, returns the average accuracy. If `False`, returns a list of boolean scores (correct/incorrect) for each sample. Defaults to `True`.

### Output Values

The metric returns a dictionary with the following keys:

- **accuracy** (`float` or `list` of `bool`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of booleans indicating correctness per sample if `return_average=False`.

*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*

#### Values from Popular Papers

Checkout the original [paper](https://openreview.net/pdf?id=44CoQe6VCq) for some reference performances.


## Limitations and Bias

- The metric relies on `json.JSONDecoder` to find the first JSON object in the prediction string. If the model output is malformed or does not contain a valid JSON, extraction may fail (returning `None`), leading to an incorrect prediction.
- It strictly expects the extracted JSON to follow the format as described in the task and optionally `explanation` for the logic to work as intended.

## Citation

Evaluation was not described in more detail in the paper but we can expect that model answers were parsed to allow for a more robust evaluation.

```bibtex
@InProceedings{huggingface:module,
title = {Test of Time Accuracy},
authors={Auss Abbood},
year={2025}
}
```