Humanity’s Last Exam Explained – The ultimate AI benchmark that sets the tone of our AI future

By Satvik Pandey | Updated on 04-Feb-2025

04-Feb-2025

Artificial intelligence (AI) has been evolving at breakneck speed, with models like OpenAI’s GPT-4 and DeepSeek’s R1 pushing the boundaries of what machines can do. We are in an era where artificial intelligence (AI) systems can write poetry, diagnose diseases, and even drive cars. At this moment, we have a new benchmark that has emerged with the promise of redifining humanity’s relationship with technology. Dubbed the “Humanity’s Last Exam,” this ambitious evaluation is being hailed as the definitive test to determine whether AI can match – or surpass – human-level reasoning, creativity, and ethical judgment. But what exactly is the Humanity’s Last Exam? Why are experts calling it humanity’s “final test”? And should we be excited or concerned about its implications? Let’s break it down.

Why do we even need a new AI benchmark?

The problem with existing AI benchmarks is pretty simple: models are acing them too easily. Take the Massive Multitask Language Understanding (MMLU) benchmark, for example. It was once the gold standard for evaluating AI general knowledge, but today’s top AI models are hitting 90%+ accuracy on it.

Image via Humanity's Last Exam /Offical Webpage — *Image via Humanity’s Last Exam/Offical Webpage*

That sounds impressive – until you realize that many of these tests weren’t designed to handle the reasoning, creativity, or multi-modal capabilities (text + image processing) that cutting-edge AI systems are starting to develop. HLE was created specifically to push AI models to their limits. Developed by the Center for AI Safety (CAIS) and Scale AI, it introduces a tougher, more comprehensive challenge that better reflects real-world intelligence.

Also Read: DeepSeek AI: Beyond ChatGPT, 5 ways DeepSeek is rewriting AI rules

What makes Humanity’s Last Exam different?

Humanity’s Last Exam is built to simulate real expert-level problem-solving rather than just regurgitating memorized facts. One of its defining features is its massive scale. The exam consists of 3,000 highly challenging questions that span more than 100 different subjects, ranging from mathematics and physics to law, medicine, and philosophy. Unlike many previous AI benchmarks that were primarily designed by researchers, Humanity’s Last Exam’s questions were crowdsourced from a global network of nearly 1,000 experts across 500+ institutions in 50 countries. This diversity ensures that the test reflects a broad spectrum of knowledge domains and problem-solving approaches.

Another major distinction is its multi-modal challenge. While most AI benchmarks focus purely on text-based reasoning, HLE incorporates a mix of both text and image-based questions, with 10% of the exam requiring AI systems to process visual information alongside written context. This added layer of complexity makes it much harder for AI models to succeed using simple pattern recognition alone. Instead, they must demonstrate the ability to integrate different types of information – something that remains a major challenge for even the most advanced AI systems today.

To further prevent AI from “gaming” the test, some of the toughest questions in HLE are kept hidden from public datasets. This is a critical improvement over older benchmarks, where AI companies could simply train their models on the test questions to artificially boost their scores. By introducing a level of secrecy, HLE ensures that models must exhibit genuine problem-solving ability rather than just memorization.

Also Read: OpenAI launches Operator: How will this AI agent impact the industry?

How well do AI models perform on Humanity’s Last Exam?

So far? Not great.

Even the best AI models today are struggling with HLE, with most scoring in the single digits or low double digits. Here’s how some notable models have fared:

Compare this to older benchmarks like MMLU, where top AI models regularly exceed 90% accuracy, and you can see just how much harder HLE is.

This tells us that while AI models may look impressive based on older tests, they’re still far from mastering complex reasoning and real-world problem-solving. The fact that no model has come close to human-level performance on HLE suggests that we still have a long way to go before AI reaches true expert-level proficiency.

Also Read: AI adoption fail: 80 per cent of companies neglect human factors critical for AI success

What does this mean for the future of AI?

HLE isn’t just a tougher exam – it’s a reality check for AI development. As AI systems keep improving, benchmarks like this will be essential in separating hype from actual progress.

One of the biggest takeaways from HLE is that it exposes AI weaknesses that still need to be addressed. Today’s models struggle with deep reasoning, multi-modal understanding, and tackling entirely new types of problems. Rather than just showing what AI can do well, HLE provides clear evidence of where it still falls short. This kind of insight is invaluable for researchers and developers looking to build more capable AI systems.

Beyond identifying weaknesses, HLE also helps set a new standard for AI development. AI companies will no longer be able to claim groundbreaking progress based solely on outdated benchmarks. Instead, they’ll have to prove that their models can handle the kinds of real-world challenges that actually matter. This could lead to more meaningful advancements in AI, with models that are better equipped to assist in high-stakes fields like science, medicine, and law.

Perhaps most importantly, HLE introduces a new level of accountability into AI development. There has been growing concern that AI companies are prioritizing flashy but superficial improvements rather than real progress. By creating a much tougher, more transparent benchmark, HLE forces companies to build AI models that actually perform well under pressure, rather than just looking good in controlled settings.

Also Read: OpenAI launches ChatGPT DeepResearch – 5 things you need to know

A watershed moment for humanity

AI is advancing faster than ever, but progress isn’t just about getting higher scores on outdated tests. Humanity’s Last Exam is the next evolution in AI benchmarking, forcing models to prove their intelligence in ways that actually matter.

If AI can start excelling at HLE, we’ll know we’re truly moving towards systems that don’t just memorize information, but actually understand and apply it – a major step toward more useful, reliable, and even trustworthy AI.

Until then, expect AI companies to be laser-focused on improving reasoning, problem-solving, and multi-modal capabilities – because if they want to claim their models are the best, they’ll have to pass Humanity’s Last Exam first.

Satvik Pandey

Satvik Pandey, is a self-professed Steve Jobs (not Apple) fanboy, a science & tech writer, and a sports addict. At Digit, he works as a Deputy Features Editor, and manages the daily functioning of the magazine. He also reviews audio-products (speakers, headphones, soundbars, etc.), smartwatches, projectors, and everything else that he can get his hands on. A media and communications graduate, Satvik is also an avid shutterbug, and when he's not working or gaming, he can be found fiddling with any camera he can get his hands on and helping produce videos – which means he spends an awful amount of time in our studio. His game of choice is Counter-Strike, and he's still attempting to turn pro. He can talk your ear off about the game, and we'd strongly advise you to steer clear of the topic unless you too are a CS junkie. View Full Profile