• @threelonmusketeers@sh.itjust.works
    link
    fedilink
    English
    24 months ago

    With V-JEPA, we mask out a large portion of a video so the model is only shown a little bit of the context. We then ask the predictor to fill in the blanks of what’s missing—not in terms of the actual pixels, but rather as a more abstract description in this representation space.

    Hmm, it looks like it aims to do for videos what chatbot LLMs do for text or what content-aware fill does for images. A useful tool, to be sure, but I think the link to AGI seems a bit tenuous.