Great Models Think Alike and this Undermines AI Oversight

1ELLIS Institute Tübingen, 2Max Planck Institute for Intelligent Systems, 3Tübingen AI Center,
4University of Tübingen, 5IIIT Hyderabad, 6Contextual AI, 7Stanford University
°Core contributors
Main figure

We propose Chance Adjusted Probabilistic Agreement (CAPA, or κp), a novel metric for model similarity which adjusts for chance agreement due to accuracy. Using CAPA, we find: (1) LLM-as-a-judge scores are biased towards more similar models controlling for the model's capability. (2) Gain from training strong models on annotations of weak supervisors (weak-to-strong generalization) is higher when the two models are more different. (3) Concerningly, model errors are getting more correlated as capabilities increase.

Abstract

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as AI Oversight. We study how model similarity affects both aspects of AI oversight by proposing Chance Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from weak-to-strong generalization. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

Similarity Metric

We propose a new metric, Chance Adjusted Probabilistic Agreement, or CAPA, which has three key properties: (1) Two models with 90% accuracy have much lesser scope to disagree than two models with 50% accuracy. CAPA adjusts for chance agreement of two independent models with the given accuracies. (2) When both models are wrong, they can still disagree. CAPA compares sample-wise predictions instead of sample-wise correctness. (3) Models provide probabilistic predictions, CAPA incorporates this information.

Metric Adjusts for
Accuracy
Distinguishes
different mistakes
Incorporates
Probabilities
%Flips = 1 - cobs
Cohen's κ, Scott's π, Fleiss κ
%Agreement
Error Consistency
Pearson / Matthew's Correlation of Errors
Divergence metrics like KL, JSD
CAPA (κp) - Ours

For CAPA's mathematical definition and theoretical properties, checkout our paper. You compute similarities between different models using our interactive tool!

Findings

Evaluation: Affinity Bias in LLM-as-a-judge

Similarity vs Judgment scores plot

LLM-as-a-judge scores show affinity bias towards more similar models, after controlling for the evaluated model's capability. For partial correlation and multiple regression analysis, checkout our paper.

Training on LM Annotations benefits from Complementary Knowledge

Similarity vs Gain from weak-to-strongn training plot

Student models trained on annotations of smaller supervisors show higher performance improvements, or weak-to-strong generalization, when the student and supervisor have lower similarity. In our paper, we also show that current weak-to-strong training methods have a higher performance ceiling than assumed previously, if they leverage complementary knowledge between the supervisor and student more effectively.

With Increasing Capabilities, Model Errors are Becoming More Correlated

Similarilty increases as capabilities increase

CAPA captures whether models make similar mistakes. By analzying 100+ open-weight models, we find that as model capabilities have increased, so has average CAPA to models from other developers in the same capability class.

Implications: If the trend of similarity increasing with capabilities continues, it could mean greater risks of affinity bias in evaluations, and lower gains from inter-LLM training.

Overall, as model blind-spots get harder to detect, making us defer more to AI oversight, models making more similar mistakes poses the risk of correlated failures.

BibTeX

@misc{goel2025greatmodelsthinkalike,
              title={Great Models Think Alike and this Undermines AI Oversight}, 
              author={Shashwat Goel and Joschka Struber and Ilze Amanda Auzina and Karuna K Chandra and Ponnurangam Kumaraguru and Douwe Kiela and Ameya Prabhu and Matthias Bethge and Jonas Geiping},
              year={2025},
              eprint={2502.04313},
              archivePrefix={arXiv},
              primaryClass={cs.LG},
              url={https://arxiv.org/abs/2502.04313}, 
        }