Papers
arxiv:2601.10700

LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Published on Jan 15
· Submitted by
Nitay Calderon
on Jan 21
Authors:
,
,
,

Abstract

A framework for generating structured counterfactual pairs using LLMs and SCMs enables improved evaluation and analysis of concept-based explanations in high-stakes domains.

AI-generated summary

Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.

Community

Paper submitter

The paper addresses the lack of reliable ground-truth benchmarks for evaluating concept-based explainability in Large Language Models. The authors introduce LIBERTy, a framework that generates "structural counterfactuals" by explicitly defining Structured Causal Models (SCMs) where the LLM acts as a component to generate text. By intervening on high-level concepts (e.g., gender, disease symptoms) within the SCM and propagating these changes to the LLM's output, the framework creates synthetic yet causally grounded datasets without relying on costly human annotation. The study introduces three such datasets (covering disease detection, CV screening, and workplace violence) and a new metric called "order-faithfulness." Experiments using LIBERTy reveal that while fine-tuned matching methods currently offer the best explanations, there is significant room for improvement, and some proprietary models like GPT-4o exhibit notably low sensitivity to demographic interventions due to safety alignment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.10700 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.10700 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.10700 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.