Last time, I wrote about how it’s becoming harder and harder to find useful benchmarks to measure AI:
AI can already play chess and Go better than the top humans; it can already win graduate-level math competitions; it can already reach the top decile in the SAT, LSAT, and creative thinking tests; it can already write papers better than most high schoolers; it can already win art and photography competitions. I’m not the first person to notice that we’re running out of ways to test artificial intelligence, because the smartest AIs are becoming smarter than the smartest human test-writers. Across nearly every intellectual domain, AI either already has or appears to be on the verge of surpassing human intelligence.
Well, there is one test I didn’t mention that AI hasn’t passed (yet) — the Graduate-Level Google-Proof Q&A Benchmark, or GPQA:
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are “Google-proof”). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4–based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions—for example, when developing new scientific knowledge—we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
In layman's terms, the authors took the world’s top experts from the most rigorous academic fields and told them to generate the hardest questions possible—questions so hard that many PhD holders struggle to answer them correctly. These questions were designed to be unique, requiring complex analysis of each issue. In other words, it was the type of test where you can’t just memorize facts or formulas or look up the answer; you have to think analytically within a very specific domain.
For instance, here are two of their questions from the chemistry section:
Maybe it’s because I’m not a lab science guy, but even if I had 10 hours to work on these problems, I doubt I’d be able to do much better than circling a random answer. Recall that GPT-4 was able to answer questions like this with 39% accuracy, which is significantly better than human non-experts (34%), but still worse than human experts (65%). In other words, a chemistry bot could probably out-cook Jesse Pinkman, but would still lag behind Walter White.
This is the Diamond Standard. If there is a better measure of intelligence than this, I don’t know what it is. If a program can consistently out-think the world's top scientists with years of training across multiple domains, then that program clearly possesses superhuman intelligence.
So, how long will that take? How long will it take for AI to reach at least 65% accuracy on these problems? Or ideally, 100%? It’s hard to say. Metaculus forecasts that we’ll likely reach 66% accuracy by the end of this year and 95% accuracy by the end of 2027. There may already exist a model that can do this, and it just hasn’t been released publicly yet — GPT-5, perhaps?
All I know is that at the pace we’re mining, it may not be long before we hit diamonds.