Scientists grade AI models using the same math as SATs

Psychology
Scientists grade AI models using the same math as SATs

To determine if a chatbot is lying, researchers are ditching simple checklists and treating the software like a student sitting for a high-stakes university entrance exam.

When a computer model hallucinates, it is not just making a mistake; it is failing a statistical probability test. To measure this, researchers have turned to psychometrics, the same branch of psychology used to design the SAT and GRE. Instead of checking a model's answers against a simple list of truths, they use item response theory. This method calculates a test-taker's hidden ability by weighing their answers against the specific difficulty and trickiness of each question. It means a model is not just judged on how many questions it gets right, but on which specific ones it misses compared to a human expert.

Continue Reading in App
1 more paragraph · plus a 3-question quiz
Open in App

Get the full experience

Download Facts A Day