• LughOPMA
    link
    fedilink
    English
    arrow-up
    26
    arrow-down
    2
    ·
    edit-2
    5 days ago

    The relevance of this test is that it is that the answers don’t already exist on the internet.

    With previous tests, where AI scored 90%, how do we know it figured out the right answer, or just copied someone else’s from its training data?

    This test better measures true independent reasoning.

    • hendrik@palaver.p3x.de
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      edit-2
      5 days ago

      But that’s kind of always the issue with AI… The datasets being contaminated with the data from validation, or the benchmarks… I don’t see a fundamental change here? It’s going to be a good benchmark at first, and once the dataset is contaminated, we need a new one… As it has been the case with the previous ones… Or an I missing something here? I mean I don’t want to be overly negative… But up until autumn, you could just ask it to count the number of 'r’s in ‘strawberry’ an it’d achieve a success rate of 10%. If this is something substantial, this isn’t it.