Topic

vision-language models

Activities

3

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q1

Top Events

OpenSearch Paris Meetup - 17 Juillet 2025 à Courbevoie (FR/EN) 1 Building AI Agents with Multimodal Models: NVIDIA DLI Workshop for Academia 1 Nov 24 - Best of ICCV (Day 4) 1

Top Speakers

Praveen Mohan Prasad (AWS) 1 Shijie Zhou (UCLA) 1

Activities

3 activities · Newest first

All Video Podcast Book

Part 3: Cross-modal Projection

2025-12-20 · Building AI Agents with Multimodal Models: NVIDIA DLI Workshop for Academia

workshop

ocr pdf processing

Transform an LLM into a Vision Language Model (VLM). Process PDFs like a pro with OCR tools.

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

2025-11-24 · Nov 24 - Best of ICCV (Day 4)

talk

by Shijie Zhou (UCLA)

4d feature reconstruction dynamic scene understanding multimodal ai space-time fine-tuning

Are Vision-Language Models Ready for Physical AI? Humans easily understand how objects move, rotate, and shift while current AI models that connect vision and language still make mistakes in what seem like simple situations: deciding “left” versus “right” when something is moving, recognizing how perspective changes, or keeping track of motion over time. To reveal these kinds of limitations, we created VLM4D, a testing suite made up of real-world and synthetic videos, each paired with questions about motion, rotation, perspective, and continuity. When we put modern vision-language models through these challenges, they performed far below human levels, especially when visual cues must be combined or the sequence of events must be maintained. But there is hope: new methods such as reconstructing visual features in 4D and fine-tuning focused on space and time show noticeable improvement, bringing us closer to AI that truly understands a dynamic physical world.

Unlocking Insights from Multimodal PDFs using OpenSearch and Vision-Language Models

2025-07-17 · OpenSearch Paris Meetup - 17 Juillet 2025 à Courbevoie (FR/EN)

talk

by Praveen Mohan Prasad (AWS)

opensearch

PDFs are packed with text, tables, and images, but extracting insights from them isn’t easy. Traditional methods involve multiple components like OCR and task-specific models—making them complex and hard to scale. Vision-Language Models like ColPali simplify this by representing all modalities in a unified format.In this session, you’ll see how ColPali can be combined with OpenSearch to enable conversational search over rich PDF content. We’ll also showcase a live demo to bring this concept to life.