• hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    3
    ·
    edit-2
    5 days ago

    What does this article want to tell me? They’ve devised a test that’s so complicated that current AI only achieves 10%. That’s all about there is. What’s the average human’s score? What’s a PhD level score? Can AI do anything with that theoretical knowledge and reasoning ability?

    But I guess it’s common knowledge at this point, that the current benchmarks aren’t cutting it. And seems from the numbers on the linked website, that the reasoning/thinking approach is quite a step up. But that’s not very surprising. I guess you’ll get closer to the truth if you think about something, rather than saying the first thing that comes to mind. Guess it’s the same for AI.

    • LughOPMA
      link
      fedilink
      English
      arrow-up
      26
      arrow-down
      2
      ·
      edit-2
      5 days ago

      The relevance of this test is that it is that the answers don’t already exist on the internet.

      With previous tests, where AI scored 90%, how do we know it figured out the right answer, or just copied someone else’s from its training data?

      This test better measures true independent reasoning.

      • hendrik@palaver.p3x.de
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        1
        ·
        edit-2
        5 days ago

        But that’s kind of always the issue with AI… The datasets being contaminated with the data from validation, or the benchmarks… I don’t see a fundamental change here? It’s going to be a good benchmark at first, and once the dataset is contaminated, we need a new one… As it has been the case with the previous ones… Or an I missing something here? I mean I don’t want to be overly negative… But up until autumn, you could just ask it to count the number of 'r’s in ‘strawberry’ an it’d achieve a success rate of 10%. If this is something substantial, this isn’t it.