I still don’t get it. And under “Future Model Performance” they say benchmarks quickly get saturated. And maybe it’s going to be the same for this one and models could achieve 50% by the end of this year… Which doesn’t really sound like the “last examn” to me. But maybe it’s more the approach of coming up with good science questions. And not the exact dataset??
I think the easiest way to explain this, is to say they are testing the ability to reason your way to an answer, to a question so unique, that it doesn’t exist anywhere on the internet.
I still don’t get it. And under “Future Model Performance” they say benchmarks quickly get saturated. And maybe it’s going to be the same for this one and models could achieve 50% by the end of this year… Which doesn’t really sound like the “last examn” to me. But maybe it’s more the approach of coming up with good science questions. And not the exact dataset??
I think the easiest way to explain this, is to say they are testing the ability to reason your way to an answer, to a question so unique, that it doesn’t exist anywhere on the internet.