We’re releasing the Video Joint Embedding Predictive Architecture (V-JEPA) model, a crucial step in advancing machine intelligence with a more grounded understanding of the world.
With V-JEPA, we mask out a large portion of a video so the model is only shown a little bit of the context. We then ask the predictor to fill in the blanks of what’s missing—not in terms of the actual pixels, but rather as a more abstract description in this representation space.
Hmm, it looks like it aims to do for videos what chatbot LLMs do for text or what content-aware fill does for images. A useful tool, to be sure, but I think the link to AGI seems a bit tenuous.
Hmm, it looks like it aims to do for videos what chatbot LLMs do for text or what content-aware fill does for images. A useful tool, to be sure, but I think the link to AGI seems a bit tenuous.