What does it mean to measure intelligence?
by Saffron Huang
As someone who studies the societal impacts of AI at an AI company, I was invited to a workshop this past spring called “Measuring AI in the World” to help “shape a more scientifically-grounded approach to measuring AI systems.” The timing felt urgent: as AI systems based on large language models are used by millions of people for just as many purposes, researchers need to create and share information about how they work, what they are (and aren’t) useful for and what their impact on society might look like, so that people can make informed decisions about how to interact with them.
Although the stakes of deploying AI appropriately feel increasingly high, there are few reliable or established approaches for measuring AI on any given dimension of interest. LLMs are as complex as economies, if not more so: they emerge from billions of interconnected parameters, and are trained on vast datasets through processes we don’t fully understand. Like other complex systems, their behaviors are hard to predict, and yet we need robust ways to evaluate them as they become more widely used.
Depending on what the AI is being used for, one might care about whether it’s as conversant in Japanese as it is in English, how consistent its behavior is across slight prompt variations, or whether it can accurately extract relevant information from legal documents. All of these require more complex approaches to measurement than quizzing the model on easily verifiable facts; there just aren’t yet standards about what that should look like. Take AI in health care, for example: existing medical AI benchmarks, like MedQA, test textbook knowledge through multiple-choice questions, but they don’t necessarily capture what’s needed for real-world applications—realistic clinical-reasoning skills, accurately synthesizing findings across disparate studies, or the capacity to adjudicate when to recommend seeking professional medical care instead of relying on a chatbot. This measurement gap is something that I try to close at work. My job involves figuring out which dimensions of AI are important to understand—both for building more beneficial AI systems and for ensuring society has the information it needs about these technologies—and casting around for ways to measure them: probing models to find (and fix) problematic behavior, analyzing real-world patterns of use and misuse, interviewing people about how AI affects their lives.
AI measurement is a new field, and everything is still under contention—not just how we test but what we should be testing for. Throughout the workshop, among desert ridges shaped like breadcrusts, participants puzzled over fundamental questions like: How do we assess whether AI is “reasoning” like humans do? Is it “truly intelligent”—but what does that mean? Even if we don’t understand its inner workings, could we still accurately predict its impact before unleashing it on the world?
Since it was founded in 2009, The Point has remained faithful to the Socratic idea that philosophy is not just a rarefied activity for scholars and academics but an ongoing conversation that helps us all live more examined lives. We rely on reader support to continue publishing.