As text-to-image (T2I) models have improved, there is a growing interest in evaluating their prompt faithfulness, which refers to the semantic coherence between the generated images and the prompts they were based on. While various T2I faithfulness metrics have been proposed using cross-modal embeddings and vision-language models (VLMs), these metrics have not been thoroughly compared and benchmarked. Instead, they are typically evaluated against a few weak baselines by correlation to human Likert scores on a set of easily distinguishable images.

To address this issue, we present T2IScoreScore (TS2), a curated set of semantic error graphs that contain a prompt and a set of increasingly erroneous images. TS2 allows us to rigorously assess whether a given prompt faithfulness metric can correctly order images based on their objective error count and significantly differentiate between different error nodes. We use meta-metric scores derived from established statistical tests to evaluate the performance of these metrics.

Surprisingly, we find that state-of-the-art VLM-based metrics, such as TIFA, DSG, LLMScore, and VIEScore, do not significantly outperform simple feature-based metrics like CLIPScore, especially on a challenging subset of naturally-occurring T2I model errors. TS2 will facilitate the development of better T2I prompt faithfulness metrics by enabling more rigorous comparisons of their conformity to expected orderings and separations under objective criteria.