Great Models Think Alike and this Undermines AI Oversight

Shashwat Goel^1,2°, Joschka Strüber^3,4°, Ilze Amanda Auzina^3,4°, Karuna K Chandra^5°, Ponnurangam Kumaraguru⁵, Douwe Kiela^6,7, Ameya Prabhu^3,4°, Matthias Bethge^3,4, Jonas Geiping^1,2,3

¹ELLIS Institute Tübingen, ²Max Planck Institute for Intelligent Systems, ³Tübingen AI Center,
⁴University of Tübingen, ⁵IIIT Hyderabad, ⁶Contextual AI, ⁷Stanford University
^°Core contributors

Paper

Code

pip install lm-sim

Try it yourself!

Data

We propose Chance Adjusted Probabilistic Agreement (CAPA, or κ_p), a novel metric for model similarity which adjusts for chance agreement due to accuracy. Using CAPA, we find: (1) LLM-as-a-judge scores are biased towards more similar models controlling for the model's capability. (2) Gain from training strong models on annotations of weak supervisors (weak-to-strong generalization) is higher when the two models are more different. (3) Concerningly, model errors are getting more correlated as capabilities increase.

Similarity Metric

We propose a new metric, Chance Adjusted Probabilistic Agreement, or CAPA, which has three key properties: (1) Two models with 90% accuracy have much lesser scope to disagree than two models with 50% accuracy. CAPA adjusts for chance agreement of two independent models with the given accuracies. (2) When both models are wrong, they can still disagree. CAPA compares sample-wise predictions instead of sample-wise correctness. (3) Models provide probabilistic predictions, CAPA incorporates this information.

Metric	Adjusts for Accuracy	Distinguishes different mistakes	Incorporates Probabilities
%Flips = 1 - c_obs	❌	❌	❌
Cohen's κ, Scott's π, Fleiss κ	❌	✅	❌
%Agreement	❌	✅	❌
Error Consistency	✅	❌	❌
Pearson / Matthew's Correlation of Errors	✅	❌	❌
Divergence metrics like KL, JSD	❌	✅	✅
CAPA (κ_p) - Ours	✅	✅	✅

For CAPA's mathematical definition and theoretical properties, checkout our paper. You compute similarities between different models using our interactive tool!

Findings

Evaluation: Affinity Bias in LLM-as-a-judge

LLM-as-a-judge scores show affinity bias towards more similar models, after controlling for the evaluated model's capability. For partial correlation and multiple regression analysis, checkout our paper.

Training on LM Annotations benefits from Complementary Knowledge

Similarity vs Gain from weak-to-strongn training plot

Student models trained on annotations of smaller supervisors show higher performance improvements, or weak-to-strong generalization, when the student and supervisor have lower similarity. In our paper, we also show that current weak-to-strong training methods have a higher performance ceiling than assumed previously, if they leverage complementary knowledge between the supervisor and student more effectively.

With Increasing Capabilities, Model Errors are Becoming More Correlated

Similarilty increases as capabilities increase

CAPA captures whether models make similar mistakes. By analzying 100+ open-weight models, we find that as model capabilities have increased, so has average CAPA to models from other developers in the same capability class.

Implications: If the trend of similarity increasing with capabilities continues, it could mean greater risks of affinity bias in evaluations, and lower gains from inter-LLM training.