--- title: Test of Time Accuracy datasets: - baharef/ToT - aauss/ToT_separate_instructions tags: - evaluate - metric - temporal reasoning description: Accuracy metric for the Test of Time benchmark by Bahar et al. (2025). sdk: gradio sdk_version: 6.0.0 app_file: app.py pinned: false emoji: 📊 colorFrom: gray colorTo: indigo --- # Metric Card for Test of Time Accuracy ## Metric Description This metric is designed for the **Test of Time (ToT)** benchmark (Bahar et al., 2025). It measures the accuracy of model predictions against reference answers. The metric expects model outputs to be formatted as a JSON object (e.g., `{"answer": "...", "explanation": "..."}`). It performs the following steps: 1. Extracts the first valid JSON object from the model's prediction string. 2. Processes the JSON based on the specified `subset`: - **semantic**: Extracts the value of the "answer" field. - **arithmetic**: Removes the "explanation" field and compares the remaining dictionary (containing the answer) to the reference. 3. Compares the processed prediction with the reference to calculate accuracy, which is a dictionary for the artihmetic subset and a string for the semantic subset. ## How to Use You can load the metric using the `evaluate` library: ```python import evaluate metric = evaluate.load("aauss/test_of_time_accuracy") predictions = [ '{"explanation": "Some explanation...", "unordered_list": ["London"]}', ' "Response without opening curly brackets...", "answer": "2005-04-07"}', ] references = [ '{"unordered_list": ["London"]}', "{'answer': '2005-04-07'}", ] print( metric.compute( predictions=predictions, references=references, subset="arithmetic", ) ) >>> 0.5 print( metric.compute( predictions=predictions, references=references, subset="arithmetic", return_average=False ) ) >>> [True, False] predictions = [ '{"explanation": "Some explanation leading to a wrong answer...", "answer": 1}', '{"explanation": "Some explanation ...", "answer": "1985"}' ] references = ["0", "1985"] print( metric.compute( predictions=predictions, references=references, subset="semantic", ) ) >>> 0.5 ``` ### Inputs - **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string that contains a JSON object (e.g., generated by an LLM). - **references** (`list` of `str`): List of reference answers. - **subset** (`str`): The subset of the benchmark being evaluated. Must be one of: - `"arithmetic"`: Used for arithmetic tasks where the answer might need structure preservation (ignores explanation). - `"semantic"`: Used for semantic tasks where only the "answer" value is compared. - **return_average** (`bool`, optional): If `True`, returns the average accuracy. If `False`, returns a list of boolean scores (correct/incorrect) for each sample. Defaults to `True`. ### Output Values The metric returns a dictionary with the following keys: - **accuracy** (`float` or `list` of `bool`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of booleans indicating correctness per sample if `return_average=False`. *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."* #### Values from Popular Papers Checkout the original [paper](https://openreview.net/pdf?id=44CoQe6VCq) for some reference performances. ## Limitations and Bias - The metric relies on `json.JSONDecoder` to find the first JSON object in the prediction string. If the model output is malformed or does not contain a valid JSON, extraction may fail (returning `None`), leading to an incorrect prediction. - It strictly expects the extracted JSON to follow the format as described in the task and optionally `explanation` for the logic to work as intended. ## Citation Evaluation was not described in more detail in the paper but we can expect that model answers were parsed to allow for a more robust evaluation. ```bibtex @InProceedings{huggingface:module, title = {Test of Time Accuracy}, authors={Auss Abbood}, year={2025} } ```