Some people are naively amazed at AI scoring 99% in bar and medical exams, when all it is doing is reproducing correct answers from internet discussions on the exam questions. A new AI benchmark called “Humanity’s Last Exam” has stumped top models. It will take independent reasoning to get 100% on this test, when that day comes does it mean AGI will be here?
Finer point, but it’s not measuring independent reasoning, afaik they’re still fully incapable of that. This test is measuring esoteric knowledge, like hummingbird anatomy and the ability to translate ancient Palmyran writing.
Current LLMs should eventually be able to ace this sort of test, as their databases grow. They could still be incapable of independent reasoning, though.
A test for independent reasoning could be something like giving it all the evidence for a never-before-discussed criminal case and asking if the accused is innocent or guilty based off the evidence. This would require a large amount of context and understanding of human societies, and the ability to infer from that what the evidence represents. Would it understand that a sound alibi means the accused is likely innocent, would it actually understand the simple concept that a physical person cannot be in two different places simultaneously, unlike how a quantum particle can seem to be? A person understands this very intuitively, but an LLM does not yet comprehend what “location” even is, even if it can provide a perfect definition of the term from a dictionary and talk about it by repeating others’ conversations.
Anyways, still an interesting project.
It will take independent reasoning to get 100% on this test
And an entire university staff. They went around and asked a bunch of PHDs what’s the hardest question you can think of. I like to think I have independent reasoning and I doubt I could answer one question correct on this exam, much less 10%.
This doesn’t prove ai doesn’t have independent reasoning it just proves it doesn’t have the obscure knowledge needed to reason about the questions.
Do you think the bar does not require independent reasoning? Granted I’ve never taken it but most high level standardized tests require a lot of reasoning. If you had a completely open book / internet access and took the SAT / ACT without any ability to reason you’d still fail horribly for the science and math sections.
No, because this test will now be discussed and invalidated for that purpose.
They say the answer to this issue is they’ve released public question samples, but the real questions are kept private.
What does this article want to tell me? They’ve devised a test that’s so complicated that current AI only achieves 10%. That’s all about there is. What’s the average human’s score? What’s a PhD level score? Can AI do anything with that theoretical knowledge and reasoning ability?
But I guess it’s common knowledge at this point, that the current benchmarks aren’t cutting it. And seems from the numbers on the linked website, that the reasoning/thinking approach is quite a step up. But that’s not very surprising. I guess you’ll get closer to the truth if you think about something, rather than saying the first thing that comes to mind. Guess it’s the same for AI.
The relevance of this test is that it is that the answers don’t already exist on the internet.
With previous tests, where AI scored 90%, how do we know it figured out the right answer, or just copied someone else’s from its training data?
This test better measures true independent reasoning.
But that’s kind of always the issue with AI… The datasets being contaminated with the data from validation, or the benchmarks… I don’t see a fundamental change here? It’s going to be a good benchmark at first, and once the dataset is contaminated, we need a new one… As it has been the case with the previous ones… Or an I missing something here? I mean I don’t want to be overly negative… But up until autumn, you could just ask it to count the number of 'r’s in ‘strawberry’ an it’d achieve a success rate of 10%. If this is something substantial, this isn’t it.
I still don’t get it. And under “Future Model Performance” they say benchmarks quickly get saturated. And maybe it’s going to be the same for this one and models could achieve 50% by the end of this year… Which doesn’t really sound like the “last examn” to me. But maybe it’s more the approach of coming up with good science questions. And not the exact dataset??
I think the easiest way to explain this, is to say they are testing the ability to reason your way to an answer, to a question so unique, that it doesn’t exist anywhere on the internet.
AI scores really low on subjects it’s never read about. Real shocker there. I’d put money on humans scoring even less on subjects they’ve never heard of.
I’d put money on humans scoring even less on subjects they’ve never heard of.
They are testing is the ability to reason. The AI, or human, can still use the internet to find out the answer. Here’s a sample question that illustrates the distinction.
Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
Failing that question doesn’t mean it can’t independently reason, it just means it doesn’t have the knowledge to reason about it. That question is basically do you know how many paired tendons are attached to each of those bones and can you add them up. If the ai, like 99.999% of people, don’t know how many tendons are attached to those bones it can’t reason the answer.
If you give the a.i. a similar question with something it knows it can reason through it fine. For example the question:
How many legs do 13 humans, 4 cats and 63 dogs have in total?
Chat gpt 4o gives the answer:
To calculate the total number of legs:
Humans: Each human has 2 legs. 13 × 2 = 26 13×2=26 legs.
Dogs: Each dog has 4 legs. 63 × 4 = 252 63×4=252 legs.
Cats: Each cat has 4 legs. 4 × 4 =16 4×4= 16 legs.
Now, add them together: 26 + 252 + 16 = 294 26+252+16=294.
Total legs = 294.
I guess I can’t guarantee it’s never seen this question before but I’d say the odds are pretty low and the odds that it’s doing independent reasoning as you call it is high.
That reads like the sort of thing Wolfram Alpha was designed to absolutely obliterate, if only the raw data representing each of those keywords had been loaded in.