Are Vision-Language Models Ready for Physical AI? Humans easily understand how objects move, rotate, and shift while current AI models that connect vision and language still make mistakes in what seem like simple situations: deciding “left” versus “right” when something is moving, recognizing how perspective changes, or keeping track of motion over time. To reveal these kinds of limitations, we created VLM4D, a testing suite made up of real-world and synthetic videos, each paired with questions about motion, rotation, perspective, and continuity. When we put modern vision-language models through these challenges, they performed far below human levels, especially when visual cues must be combined or the sequence of events must be maintained. But there is hope: new methods such as reconstructing visual features in 4D and fine-tuning focused on space and time show noticeable improvement, bringing us closer to AI that truly understands a dynamic physical world.
talk-data.com
S
Speaker
Shijie Zhou
1
talks
final-year PhD candidate
UCLA
Shijie Zhou is a final-year PhD candidate at UCLA, recipient of the 2026 Dissertation Year Award and the Graduate Dean’s Scholar Award. His research focuses on spatial intelligence, spanning 3D/4D scene reconstruction and generation, vision-language models, generative AI, and interactive agentic systems. His work has been recognized at top conferences including CVPR, ICCV, ECCV, ICLR, and NeurIPS, and has also led to practical impact through research internships at Google and Apple.
Bio from: Nov 24 - Best of ICCV (Day 4)
Filtering by:
Nov 24 - Best of ICCV (Day 4)
×
Filter by Event / Source
Talks & appearances
Showing 1 of 7 activities