When AI is tested on questions it can't model from pre-existing answers on the internet, it only scores 10% in the test.

Lugh · 5 months ago

When AI is tested on questions it can't model from pre-existing answers on the internet, it only scores 10% in the test.

hendrik@palaver.p3x.de · edit-2 5 months ago

What does this article want to tell me? They’ve devised a test that’s so complicated that current AI only achieves 10%. That’s all about there is. What’s the average human’s score? What’s a PhD level score? Can AI do anything with that theoretical knowledge and reasoning ability?

But I guess it’s common knowledge at this point, that the current benchmarks aren’t cutting it. And seems from the numbers on the linked website, that the reasoning/thinking approach is quite a step up. But that’s not very surprising. I guess you’ll get closer to the truth if you think about something, rather than saying the first thing that comes to mind. Guess it’s the same for AI.

Lugh · edit-2 5 months ago

The relevance of this test is that it is that the answers don’t already exist on the internet.

With previous tests, where AI scored 90%, how do we know it figured out the right answer, or just copied someone else’s from its training data?

This test better measures true independent reasoning.

hendrik@palaver.p3x.de · edit-2 5 months ago

But that’s kind of always the issue with AI… The datasets being contaminated with the data from validation, or the benchmarks… I don’t see a fundamental change here? It’s going to be a good benchmark at first, and once the dataset is contaminated, we need a new one… As it has been the case with the previous ones… Or an I missing something here? I mean I don’t want to be overly negative… But up until autumn, you could just ask it to count the number of 'r’s in ‘strawberry’ an it’d achieve a success rate of 10%. If this is something substantial, this isn’t it.

Lugh · edit-2 5 months ago

The dataset consists of 3,000 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting.

They say they’ve addressed this issue.

hendrik@palaver.p3x.de · 5 months ago

I still don’t get it. And under “Future Model Performance” they say benchmarks quickly get saturated. And maybe it’s going to be the same for this one and models could achieve 50% by the end of this year… Which doesn’t really sound like the “last examn” to me. But maybe it’s more the approach of coming up with good science questions. And not the exact dataset??

Lugh · 5 months ago

I think the easiest way to explain this, is to say they are testing the ability to reason your way to an answer, to a question so unique, that it doesn’t exist anywhere on the internet.

When AI is tested on questions it can't model from pre-existing answers on the internet, it only scores 10% in the test.

When AI is tested on questions it can't model from pre-existing answers on the internet, it only scores 10% in the test.

Researchers just stumped AI with their most difficult test — but for how long?