Researchers tested Large Reasoning Models on various puzzles. As the puzzles got more difficult the AIs failed more, until at a certain point they all failed completely.

Even without the ability to reason, current AI will still be revolutionary. It can get us to Level 4 self-driving, and outperform doctors, and many other professionals in their work. It should make humanoid robots capable of much physical work.

Still, this research suggests the current approach to AI will not lead to AGI, no matter how much training and scaling you try. That’s a problem for the people throwing hundreds of billions of dollars at this approach, hoping it will pay off with a new AGI Tech Unicorn to rival Google or Meta in revenues.

Apple study finds “a fundamental scaling limitation” in reasoning models’ thinking abilities

  • mindbleach@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    1 day ago

    This headline overstates their prediction.

    “Current reasoning models” means LLMs with goofy prompts and extra training. They’re gonna be weak to any puzzle where the solution is a thousand words long and goes “left right right middle right left.” Like asking it to repeat the word “elephant” forever. The math doesn’t like it. Tiny factors deep in a pile of linear algebra flip out, and the original prompt vanishes into the noise.

    This is kind of silly for puzzles where partial solutions are also valid puzzles. Page two of the paper, Claude showed twenty thousand tokens for Tower Of Hanoi with ten disks. A fucking Atari can solve this puzzle. It’s just parity. You’re moving N disks to one of two spaces so you can move disk N+1 to the other. It’s only exponential because you repeat every step for every disk. Each word of the model’s output becomes part of its context. Elephant elephant elephant elephant.

    I’d expect distinct results if they asked for the next move, singular. Maybe if you want the model to swallow the whole elephant by itself, be very Pi (1998) and have it “restate its assumptions” between steps.

    Model types with very long context, like whatever happened to Mamba, should at-worst fail similarly for much higher degrees of complexity. Text diffusion is probably limited to smaller outputs, since revising the whole thing at once is kinda the point, but it could still catch bad explanations for the next step. I fully do not understand how “continuous thought machines” work, but incrementally approaching very large puzzles sounds like their whole deal.

    “AGI will never come from LLMs, specifically” is a dead easy claim to believe. Please avoid making it sound like “neural networks are altogether hosed.”

    • Rin@lemm.ee
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 day ago

      They’re gonna be weak to any puzzle where the solution is a thousand words long

      I did a test. I made my own puzzle in the form of a chessboard. Black pieces meant 0s and white pieces meant 1s. On the board, right to left, top to bottom was encoded an ascii string. No AI I have tried (even o3 & o1-pro at max reasoning) could solve this puzzle without huge huge hand holding. A human could figure it out within 30 mins, I’d say.

      “AGI will never come from LLMs, specifically” is a dead easy claim to believe. Please avoid making it sound like “neural networks are altogether hosed.”

      Of course, but a lot of people (ahm, fuck ai, ahm) don’t seem to understand this. they’ll just circle jerk themselves until their dicks fall off. They see this as “computer will never think”. Also, i’ve seen statistical models do crazy shit for the benefit of humanity. For example, reconstructing a human heart from MRI images and compiling reports that would otherwise take doctors hours to do and more acurately than a doctor would. But again, that’s because that model was not text based.