• hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    1
    ·
    edit-2
    5 days ago

    But that’s kind of always the issue with AI… The datasets being contaminated with the data from validation, or the benchmarks… I don’t see a fundamental change here? It’s going to be a good benchmark at first, and once the dataset is contaminated, we need a new one… As it has been the case with the previous ones… Or an I missing something here? I mean I don’t want to be overly negative… But up until autumn, you could just ask it to count the number of 'r’s in ‘strawberry’ an it’d achieve a success rate of 10%. If this is something substantial, this isn’t it.