talk-data.com talk-data.com

Topic

vision-language models

3

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

3 activities · Newest first

Are Vision-Language Models Ready for Physical AI? Humans easily understand how objects move, rotate, and shift while current AI models that connect vision and language still make mistakes in what seem like simple situations: deciding “left” versus “right” when something is moving, recognizing how perspective changes, or keeping track of motion over time. To reveal these kinds of limitations, we created VLM4D, a testing suite made up of real-world and synthetic videos, each paired with questions about motion, rotation, perspective, and continuity. When we put modern vision-language models through these challenges, they performed far below human levels, especially when visual cues must be combined or the sequence of events must be maintained. But there is hope: new methods such as reconstructing visual features in 4D and fine-tuning focused on space and time show noticeable improvement, bringing us closer to AI that truly understands a dynamic physical world.

PDFs are packed with text, tables, and images, but extracting insights from them isn’t easy. Traditional methods involve multiple components like OCR and task-specific models—making them complex and hard to scale. Vision-Language Models like ColPali simplify this by representing all modalities in a unified format.In this session, you’ll see how ColPali can be combined with OpenSearch to enable conversational search over rich PDF content. We’ll also showcase a live demo to bring this concept to life.