talk-data.com talk-data.com

Topic

multimodal ai

5

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

5 activities · Newest first

Are Vision-Language Models Ready for Physical AI? Humans easily understand how objects move, rotate, and shift while current AI models that connect vision and language still make mistakes in what seem like simple situations: deciding “left” versus “right” when something is moving, recognizing how perspective changes, or keeping track of motion over time. To reveal these kinds of limitations, we created VLM4D, a testing suite made up of real-world and synthetic videos, each paired with questions about motion, rotation, perspective, and continuity. When we put modern vision-language models through these challenges, they performed far below human levels, especially when visual cues must be combined or the sequence of events must be maintained. But there is hope: new methods such as reconstructing visual features in 4D and fine-tuning focused on space and time show noticeable improvement, bringing us closer to AI that truly understands a dynamic physical world.

60-minute live session exploring how multimodal AI is reshaping industries, driving business intelligence, and creating new competitive advantages. Learn how multimodal AI differs from traditional models, real-world business use cases transforming operations, strategies to prepare your organization for the multimodal future, and key tools and skills for leveraging AI innovation.

We present a multimodal AI pipeline to streamline patient selection and quality assessment for radiology AI development. Our system evaluates patient clinical histories, imaging protocols, and data quality, embedding results into imaging metadata. Using FiftyOne researchers can rapidly filter and assemble high-quality cohorts in minutes instead of weeks, freeing radiologists for clinical work and accelerating AI tool development.

Gemini is the most capable and general model Google has ever built. It was built from the ground up to be multimodal, which means it can generalize and seamlessly understand, operate across, and combine different types of information, including text, code, images, and video. This talk dives into the exciting world of Gemini, a cutting-edge foundation model developed by Google. Discover how Gemini seamlessly integrates text and image processing, enabling you to:\n- Analyze and understand the content of images, videos, and audio files\n- Perform cross-modal tasks like image captioning and visual question-answering\n- Explore the potential of multimodality for various applications, from creative content generation to advanced information retrieval.