talk-data.com talk-data.com

Filter by Source

Select conferences and events

People (124 results)

See all 124 →

Companies (9 results)

See all 9 →
Azul Systems 1 speaker
Senior Developer Advocate
T Systems 1 speaker
CTO Cloud Services
BAE Systems 1 speaker
BI & AI Capabilities Delivery Manager

Activities & events

Title & Speakers Event

Join us for a virtual event to hear talks from experts on the latest developments in Visual Document AI.

Date and Location

Nov 6, 2025 9-11 AM Pacific Online. Register for the Zoom!

Document AI: A Review of the Latest Models, Tasks and Tools

In this talk, go through everything document AI: trends, models, tasks, tools! By the end of this talk you will be able to get to building apps based on document models

About the Speaker

Merve Noyan works on multimodal AI and computer vision at Hugging Face, and she's the author of the book Vision Language Models on O'Reilly.

Run Document VLMs in Voxel51 with the VLM Run Plugin — PDF to JSON in Seconds

The new VLM Run Plugin for Voxel51 enables seamless execution of document vision-language models directly within the Voxel51 environment. This integration transforms complex document workflows — from PDFs and scanned forms to reports — into structured JSON outputs in seconds. By treating documents as images, our approach remains general, scalable, and compatible with any visual model architecture. The plugin connects visual data curation with model inference, empowering teams to run, visualize, and evaluate document understanding models effortlessly. Document AI is now faster, reproducible, and natively integrated into your Voxel51 workflows.

About the Speaker

Dinesh Reddy is a founding team member of VLM Run, where he is helping nurture the platform from a sapling into a robust ecosystem for running and evaluating vision-language models across modalities. Previously, he was a scientist at Amazon AWS AI, working on large-scale machine learning systems for intelligent document understanding and visual AI. He completed his Ph.D. at the Robotics Institute, Carnegie Mellon University, focusing on combining learning-based methods with 3D computer vision for in-the-wild data. His research has been recognized with the Best Paper Award at IEEE IVS 2021 and fellowships from Amazon Go and Qualcomm.

CommonForms: Automatically Making PDFs Fillable

Converting static PDFs into fillable forms remains a surprisingly difficult task, even with the best commercial tools available today. We show that with careful dataset curation and model tuning, it is possible to train high-quality form field detectors for under $500. As part of this effort, we introduce CommonForms, a large-scale dataset of nearly half a million curated form images. We also release a family of highly accurate form field detectors, FFDNet-S and FFDNet-L.

About the Speaker

Joe Barrow is a researcher at Pattern Data, specializing in document AI and information extraction. He previously worked at the Adobe Document Intelligence Lab after receiving his PhD from the University of Maryland in 2022.

Visual Document Retrieval: How to Cluster, Search and Uncover Biases in Document Image Datasets Using Embeddings

In this talk you'll learn about the task of visual document retrieval, the models which are widely used by the community, and see them in action through the open source FiftyOne App where you'll learn how to use these models to identify groups and clusters of documents, find unique documents, uncover biases in your visual document dataset, and search over your document corpus using natural language.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.

Nov 6 - Visual Document AI: Because a Pixel is Worth a Thousand Tokens

Join us for a virtual event to hear talks from experts on the latest developments in Visual Document AI.

Date and Location

Nov 6, 2025 9-11 AM Pacific Online. Register for the Zoom!

Document AI: A Review of the Latest Models, Tasks and Tools

In this talk, go through everything document AI: trends, models, tasks, tools! By the end of this talk you will be able to get to building apps based on document models

About the Speaker

Merve Noyan works on multimodal AI and computer vision at Hugging Face, and she's the author of the book Vision Language Models on O'Reilly.

Run Document VLMs in Voxel51 with the VLM Run Plugin — PDF to JSON in Seconds

The new VLM Run Plugin for Voxel51 enables seamless execution of document vision-language models directly within the Voxel51 environment. This integration transforms complex document workflows — from PDFs and scanned forms to reports — into structured JSON outputs in seconds. By treating documents as images, our approach remains general, scalable, and compatible with any visual model architecture. The plugin connects visual data curation with model inference, empowering teams to run, visualize, and evaluate document understanding models effortlessly. Document AI is now faster, reproducible, and natively integrated into your Voxel51 workflows.

About the Speaker

Dinesh Reddy is a founding team member of VLM Run, where he is helping nurture the platform from a sapling into a robust ecosystem for running and evaluating vision-language models across modalities. Previously, he was a scientist at Amazon AWS AI, working on large-scale machine learning systems for intelligent document understanding and visual AI. He completed his Ph.D. at the Robotics Institute, Carnegie Mellon University, focusing on combining learning-based methods with 3D computer vision for in-the-wild data. His research has been recognized with the Best Paper Award at IEEE IVS 2021 and fellowships from Amazon Go and Qualcomm.

CommonForms: Automatically Making PDFs Fillable

Converting static PDFs into fillable forms remains a surprisingly difficult task, even with the best commercial tools available today. We show that with careful dataset curation and model tuning, it is possible to train high-quality form field detectors for under $500. As part of this effort, we introduce CommonForms, a large-scale dataset of nearly half a million curated form images. We also release a family of highly accurate form field detectors, FFDNet-S and FFDNet-L.

About the Speaker

Joe Barrow is a researcher at Pattern Data, specializing in document AI and information extraction. He previously worked at the Adobe Document Intelligence Lab after receiving his PhD from the University of Maryland in 2022.

Visual Document Retrieval: How to Cluster, Search and Uncover Biases in Document Image Datasets Using Embeddings

In this talk you'll learn about the task of visual document retrieval, the models which are widely used by the community, and see them in action through the open source FiftyOne App where you'll learn how to use these models to identify groups and clusters of documents, find unique documents, uncover biases in your visual document dataset, and search over your document corpus using natural language.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.

Nov 6 - Visual Document AI: Because a Pixel is Worth a Thousand Tokens

Join us for a virtual event to hear talks from experts on the latest developments in Visual Document AI.

Date and Location

Nov 6, 2025 9-11 AM Pacific Online. Register for the Zoom!

Document AI: A Review of the Latest Models, Tasks and Tools

In this talk, go through everything document AI: trends, models, tasks, tools! By the end of this talk you will be able to get to building apps based on document models

About the Speaker

Merve Noyan works on multimodal AI and computer vision at Hugging Face, and she's the author of the book Vision Language Models on O'Reilly.

Run Document VLMs in Voxel51 with the VLM Run Plugin — PDF to JSON in Seconds

The new VLM Run Plugin for Voxel51 enables seamless execution of document vision-language models directly within the Voxel51 environment. This integration transforms complex document workflows — from PDFs and scanned forms to reports — into structured JSON outputs in seconds. By treating documents as images, our approach remains general, scalable, and compatible with any visual model architecture. The plugin connects visual data curation with model inference, empowering teams to run, visualize, and evaluate document understanding models effortlessly. Document AI is now faster, reproducible, and natively integrated into your Voxel51 workflows.

About the Speaker

Dinesh Reddy is a founding team member of VLM Run, where he is helping nurture the platform from a sapling into a robust ecosystem for running and evaluating vision-language models across modalities. Previously, he was a scientist at Amazon AWS AI, working on large-scale machine learning systems for intelligent document understanding and visual AI. He completed his Ph.D. at the Robotics Institute, Carnegie Mellon University, focusing on combining learning-based methods with 3D computer vision for in-the-wild data. His research has been recognized with the Best Paper Award at IEEE IVS 2021 and fellowships from Amazon Go and Qualcomm.

CommonForms: Automatically Making PDFs Fillable

Converting static PDFs into fillable forms remains a surprisingly difficult task, even with the best commercial tools available today. We show that with careful dataset curation and model tuning, it is possible to train high-quality form field detectors for under $500. As part of this effort, we introduce CommonForms, a large-scale dataset of nearly half a million curated form images. We also release a family of highly accurate form field detectors, FFDNet-S and FFDNet-L.

About the Speaker

Joe Barrow is a researcher at Pattern Data, specializing in document AI and information extraction. He previously worked at the Adobe Document Intelligence Lab after receiving his PhD from the University of Maryland in 2022.

Visual Document Retrieval: How to Cluster, Search and Uncover Biases in Document Image Datasets Using Embeddings

In this talk you'll learn about the task of visual document retrieval, the models which are widely used by the community, and see them in action through the open source FiftyOne App where you'll learn how to use these models to identify groups and clusters of documents, find unique documents, uncover biases in your visual document dataset, and search over your document corpus using natural language.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.

Nov 6 - Visual Document AI: Because a Pixel is Worth a Thousand Tokens

Join us for a virtual event to hear talks from experts on the latest developments in Visual Document AI.

Date and Location

Nov 6, 2025 9-11 AM Pacific Online. Register for the Zoom!

Document AI: A Review of the Latest Models, Tasks and Tools

In this talk, go through everything document AI: trends, models, tasks, tools! By the end of this talk you will be able to get to building apps based on document models

About the Speaker

Merve Noyan works on multimodal AI and computer vision at Hugging Face, and she's the author of the book Vision Language Models on O'Reilly.

Run Document VLMs in Voxel51 with the VLM Run Plugin — PDF to JSON in Seconds

The new VLM Run Plugin for Voxel51 enables seamless execution of document vision-language models directly within the Voxel51 environment. This integration transforms complex document workflows — from PDFs and scanned forms to reports — into structured JSON outputs in seconds. By treating documents as images, our approach remains general, scalable, and compatible with any visual model architecture. The plugin connects visual data curation with model inference, empowering teams to run, visualize, and evaluate document understanding models effortlessly. Document AI is now faster, reproducible, and natively integrated into your Voxel51 workflows.

About the Speaker

Dinesh Reddy is a founding team member of VLM Run, where he is helping nurture the platform from a sapling into a robust ecosystem for running and evaluating vision-language models across modalities. Previously, he was a scientist at Amazon AWS AI, working on large-scale machine learning systems for intelligent document understanding and visual AI. He completed his Ph.D. at the Robotics Institute, Carnegie Mellon University, focusing on combining learning-based methods with 3D computer vision for in-the-wild data. His research has been recognized with the Best Paper Award at IEEE IVS 2021 and fellowships from Amazon Go and Qualcomm.

CommonForms: Automatically Making PDFs Fillable

Converting static PDFs into fillable forms remains a surprisingly difficult task, even with the best commercial tools available today. We show that with careful dataset curation and model tuning, it is possible to train high-quality form field detectors for under $500. As part of this effort, we introduce CommonForms, a large-scale dataset of nearly half a million curated form images. We also release a family of highly accurate form field detectors, FFDNet-S and FFDNet-L.

About the Speaker

Joe Barrow is a researcher at Pattern Data, specializing in document AI and information extraction. He previously worked at the Adobe Document Intelligence Lab after receiving his PhD from the University of Maryland in 2022.

Visual Document Retrieval: How to Cluster, Search and Uncover Biases in Document Image Datasets Using Embeddings

In this talk you'll learn about the task of visual document retrieval, the models which are widely used by the community, and see them in action through the open source FiftyOne App where you'll learn how to use these models to identify groups and clusters of documents, find unique documents, uncover biases in your visual document dataset, and search over your document corpus using natural language.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.

Nov 6 - Visual Document AI: Because a Pixel is Worth a Thousand Tokens

Join us for a virtual event to hear talks from experts on the latest developments in Visual Document AI.

Date and Location

Nov 6, 2025 9-11 AM Pacific Online. Register for the Zoom!

Document AI: A Review of the Latest Models, Tasks and Tools

In this talk, go through everything document AI: trends, models, tasks, tools! By the end of this talk you will be able to get to building apps based on document models

About the Speaker

Merve Noyan works on multimodal AI and computer vision at Hugging Face, and she's the author of the book Vision Language Models on O'Reilly.

Run Document VLMs in Voxel51 with the VLM Run Plugin — PDF to JSON in Seconds

The new VLM Run Plugin for Voxel51 enables seamless execution of document vision-language models directly within the Voxel51 environment. This integration transforms complex document workflows — from PDFs and scanned forms to reports — into structured JSON outputs in seconds. By treating documents as images, our approach remains general, scalable, and compatible with any visual model architecture. The plugin connects visual data curation with model inference, empowering teams to run, visualize, and evaluate document understanding models effortlessly. Document AI is now faster, reproducible, and natively integrated into your Voxel51 workflows.

About the Speaker

Dinesh Reddy is a founding team member of VLM Run, where he is helping nurture the platform from a sapling into a robust ecosystem for running and evaluating vision-language models across modalities. Previously, he was a scientist at Amazon AWS AI, working on large-scale machine learning systems for intelligent document understanding and visual AI. He completed his Ph.D. at the Robotics Institute, Carnegie Mellon University, focusing on combining learning-based methods with 3D computer vision for in-the-wild data. His research has been recognized with the Best Paper Award at IEEE IVS 2021 and fellowships from Amazon Go and Qualcomm.

CommonForms: Automatically Making PDFs Fillable

Converting static PDFs into fillable forms remains a surprisingly difficult task, even with the best commercial tools available today. We show that with careful dataset curation and model tuning, it is possible to train high-quality form field detectors for under $500. As part of this effort, we introduce CommonForms, a large-scale dataset of nearly half a million curated form images. We also release a family of highly accurate form field detectors, FFDNet-S and FFDNet-L.

About the Speaker

Joe Barrow is a researcher at Pattern Data, specializing in document AI and information extraction. He previously worked at the Adobe Document Intelligence Lab after receiving his PhD from the University of Maryland in 2022.

Visual Document Retrieval: How to Cluster, Search and Uncover Biases in Document Image Datasets Using Embeddings

In this talk you'll learn about the task of visual document retrieval, the models which are widely used by the community, and see them in action through the open source FiftyOne App where you'll learn how to use these models to identify groups and clusters of documents, find unique documents, uncover biases in your visual document dataset, and search over your document corpus using natural language.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.

Nov 6 - Visual Document AI: Because a Pixel is Worth a Thousand Tokens

Join us for a virtual event to hear talks from experts on the latest developments in Visual Document AI.

Date and Location

Nov 6, 2025 9-11 AM Pacific Online. Register for the Zoom!

Document AI: A Review of the Latest Models, Tasks and Tools

In this talk, go through everything document AI: trends, models, tasks, tools! By the end of this talk you will be able to get to building apps based on document models

About the Speaker

Merve Noyan works on multimodal AI and computer vision at Hugging Face, and she's the author of the book Vision Language Models on O'Reilly.

Run Document VLMs in Voxel51 with the VLM Run Plugin — PDF to JSON in Seconds

The new VLM Run Plugin for Voxel51 enables seamless execution of document vision-language models directly within the Voxel51 environment. This integration transforms complex document workflows — from PDFs and scanned forms to reports — into structured JSON outputs in seconds. By treating documents as images, our approach remains general, scalable, and compatible with any visual model architecture. The plugin connects visual data curation with model inference, empowering teams to run, visualize, and evaluate document understanding models effortlessly. Document AI is now faster, reproducible, and natively integrated into your Voxel51 workflows.

About the Speaker

Dinesh Reddy is a founding team member of VLM Run, where he is helping nurture the platform from a sapling into a robust ecosystem for running and evaluating vision-language models across modalities. Previously, he was a scientist at Amazon AWS AI, working on large-scale machine learning systems for intelligent document understanding and visual AI. He completed his Ph.D. at the Robotics Institute, Carnegie Mellon University, focusing on combining learning-based methods with 3D computer vision for in-the-wild data. His research has been recognized with the Best Paper Award at IEEE IVS 2021 and fellowships from Amazon Go and Qualcomm.

CommonForms: Automatically Making PDFs Fillable

Converting static PDFs into fillable forms remains a surprisingly difficult task, even with the best commercial tools available today. We show that with careful dataset curation and model tuning, it is possible to train high-quality form field detectors for under $500. As part of this effort, we introduce CommonForms, a large-scale dataset of nearly half a million curated form images. We also release a family of highly accurate form field detectors, FFDNet-S and FFDNet-L.

About the Speaker

Joe Barrow is a researcher at Pattern Data, specializing in document AI and information extraction. He previously worked at the Adobe Document Intelligence Lab after receiving his PhD from the University of Maryland in 2022.

Visual Document Retrieval: How to Cluster, Search and Uncover Biases in Document Image Datasets Using Embeddings

In this talk you'll learn about the task of visual document retrieval, the models which are widely used by the community, and see them in action through the open source FiftyOne App where you'll learn how to use these models to identify groups and clusters of documents, find unique documents, uncover biases in your visual document dataset, and search over your document corpus using natural language.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.

Nov 6 - Visual Document AI: Because a Pixel is Worth a Thousand Tokens

Join us for a virtual event to hear talks from experts on the latest developments in Visual Document AI.

Date and Location

Nov 6, 2025 9-11 AM Pacific Online. Register for the Zoom!

Document AI: A Review of the Latest Models, Tasks and Tools

In this talk, go through everything document AI: trends, models, tasks, tools! By the end of this talk you will be able to get to building apps based on document models

About the Speaker

Merve Noyan works on multimodal AI and computer vision at Hugging Face, and she's the author of the book Vision Language Models on O'Reilly.

Run Document VLMs in Voxel51 with the VLM Run Plugin — PDF to JSON in Seconds

The new VLM Run Plugin for Voxel51 enables seamless execution of document vision-language models directly within the Voxel51 environment. This integration transforms complex document workflows — from PDFs and scanned forms to reports — into structured JSON outputs in seconds. By treating documents as images, our approach remains general, scalable, and compatible with any visual model architecture. The plugin connects visual data curation with model inference, empowering teams to run, visualize, and evaluate document understanding models effortlessly. Document AI is now faster, reproducible, and natively integrated into your Voxel51 workflows.

About the Speaker

Dinesh Reddy is a founding team member of VLM Run, where he is helping nurture the platform from a sapling into a robust ecosystem for running and evaluating vision-language models across modalities. Previously, he was a scientist at Amazon AWS AI, working on large-scale machine learning systems for intelligent document understanding and visual AI. He completed his Ph.D. at the Robotics Institute, Carnegie Mellon University, focusing on combining learning-based methods with 3D computer vision for in-the-wild data. His research has been recognized with the Best Paper Award at IEEE IVS 2021 and fellowships from Amazon Go and Qualcomm.

CommonForms: Automatically Making PDFs Fillable

Converting static PDFs into fillable forms remains a surprisingly difficult task, even with the best commercial tools available today. We show that with careful dataset curation and model tuning, it is possible to train high-quality form field detectors for under $500. As part of this effort, we introduce CommonForms, a large-scale dataset of nearly half a million curated form images. We also release a family of highly accurate form field detectors, FFDNet-S and FFDNet-L.

About the Speaker

Joe Barrow is a researcher at Pattern Data, specializing in document AI and information extraction. He previously worked at the Adobe Document Intelligence Lab after receiving his PhD from the University of Maryland in 2022.

Visual Document Retrieval: How to Cluster, Search and Uncover Biases in Document Image Datasets Using Embeddings

In this talk you'll learn about the task of visual document retrieval, the models which are widely used by the community, and see them in action through the open source FiftyOne App where you'll learn how to use these models to identify groups and clusters of documents, find unique documents, uncover biases in your visual document dataset, and search over your document corpus using natural language.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.

Nov 6 - Visual Document AI: Because a Pixel is Worth a Thousand Tokens
Emilie Nenquin – Head of Data & Intelligence @ VRT , Stijn Dolphen – Team Lead & Analytics Engineer @ Dataroots

Send us a text In this episode, we explore how public media can build scalable, transparent, and mission-driven data infrastructure - with Emilie Nenquin, Head of Data & Intelligence at VRT, and Stijn Dolphen, Team Lead & Analytics Engineer at Dataroots. Emilie shares how she architected VRT’s data transformation from the ground up: evolving from basic analytics to a full-stack data organization with 45+ specialists across engineering, analytics, AI, and user management. We dive into the strategic shift from Adobe Analytics to Snowplow, and what it means to own your data pipeline in a public service context. Stijn joins to unpack the technical decisions behind VRT’s current architecture, including real-time event tracking, metadata modeling, and integrating 70+ digital platforms into a unified ecosystem. 💡 Topics include: Designing data infrastructure for transparency and scaleBuilding a modular, privacy-conscious analytics stackMetadata governance across fragmented content systemsRecommendation systems for discovery, not just engagementThe circular relationship between data quality and AI performanceApplying machine learning in service of cultural and civic missionsWhether you're leading a data team, rethinking your stack, or exploring ethical AI in media, this episode offers practical insights into how data strategy can align with public value.

Adobe Analytics AI/ML Analytics Data Quality Snowplow
DataTopics: All Things Data, AI & Tech

Join us for a virtual event to hear talks from experts on the current state of visual agents.

When

Aug 7, 2025 at 9 AM Pacific

Where

Virtual. Register for the Zoom.

Foundational capabilities and models for generalist agents for computers

As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.

We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.

About the Speaker

Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India.

BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities

The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.

About the Speaker

Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content.

Visual Agents: What it takes to build an agent that can navigate GUIs like humans

We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Implementing a Practical Vision-Based Android AI Agent

In this talk I will share with you practical details of designing and implementing Android AI agents, using deki.

From theory we will move to practice and the usage of these agents in industry/production.

For end users - remote usage of Android phones or for automation of standard tasks. Such as:

  • "Write my friend 'some_name' in WhatsApp that I'll be 15 minutes late"
  • "Open Twitter in the browser and write a post about 'something'"
  • "Read my latest notifications and say if there are any important ones"
  • "Write a linkedin post about 'something'"

And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core.

About the Speaker

Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control.

He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye.

August 7 - Understanding Visual Agents

Join us for a virtual event to hear talks from experts on the current state of visual agents.

When

Aug 7, 2025 at 9 AM Pacific

Where

Virtual. Register for the Zoom.

Foundational capabilities and models for generalist agents for computers

As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.

We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.

About the Speaker

Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India.

BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities

The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.

About the Speaker

Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content.

Visual Agents: What it takes to build an agent that can navigate GUIs like humans

We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Implementing a Practical Vision-Based Android AI Agent

In this talk I will share with you practical details of designing and implementing Android AI agents, using deki.

From theory we will move to practice and the usage of these agents in industry/production.

For end users - remote usage of Android phones or for automation of standard tasks. Such as:

  • "Write my friend 'some_name' in WhatsApp that I'll be 15 minutes late"
  • "Open Twitter in the browser and write a post about 'something'"
  • "Read my latest notifications and say if there are any important ones"
  • "Write a linkedin post about 'something'"

And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core.

About the Speaker

Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control.

He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye.

August 7 - Understanding Visual Agents

Join us for a virtual event to hear talks from experts on the current state of visual agents.

When

Aug 7, 2025 at 9 AM Pacific

Where

Virtual. Register for the Zoom.

Foundational capabilities and models for generalist agents for computers

As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.

We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.

About the Speaker

Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India.

BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities

The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.

About the Speaker

Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content.

Visual Agents: What it takes to build an agent that can navigate GUIs like humans

We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Implementing a Practical Vision-Based Android AI Agent

In this talk I will share with you practical details of designing and implementing Android AI agents, using deki.

From theory we will move to practice and the usage of these agents in industry/production.

For end users - remote usage of Android phones or for automation of standard tasks. Such as:

  • "Write my friend 'some_name' in WhatsApp that I'll be 15 minutes late"
  • "Open Twitter in the browser and write a post about 'something'"
  • "Read my latest notifications and say if there are any important ones"
  • "Write a linkedin post about 'something'"

And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core.

About the Speaker

Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control.

He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye.

August 7 - Understanding Visual Agents

Join us for a virtual event to hear talks from experts on the current state of visual agents.

When

Aug 7, 2025 at 9 AM Pacific

Where

Virtual. Register for the Zoom.

Foundational capabilities and models for generalist agents for computers

As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.

We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.

About the Speaker

Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India.

BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities

The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.

About the Speaker

Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content.

Visual Agents: What it takes to build an agent that can navigate GUIs like humans

We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Implementing a Practical Vision-Based Android AI Agent

In this talk I will share with you practical details of designing and implementing Android AI agents, using deki.

From theory we will move to practice and the usage of these agents in industry/production.

For end users - remote usage of Android phones or for automation of standard tasks. Such as:

  • "Write my friend 'some_name' in WhatsApp that I'll be 15 minutes late"
  • "Open Twitter in the browser and write a post about 'something'"
  • "Read my latest notifications and say if there are any important ones"
  • "Write a linkedin post about 'something'"

And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core.

About the Speaker

Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control.

He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye.

August 7 - Understanding Visual Agents

Join us for a virtual event to hear talks from experts on the current state of visual agents.

When

Aug 7, 2025 at 9 AM Pacific

Where

Virtual. Register for the Zoom.

Foundational capabilities and models for generalist agents for computers

As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.

We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.

About the Speaker

Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India.

BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities

The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.

About the Speaker

Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content.

Visual Agents: What it takes to build an agent that can navigate GUIs like humans

We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Implementing a Practical Vision-Based Android AI Agent

In this talk I will share with you practical details of designing and implementing Android AI agents, using deki.

From theory we will move to practice and the usage of these agents in industry/production.

For end users - remote usage of Android phones or for automation of standard tasks. Such as:

  • "Write my friend 'some_name' in WhatsApp that I'll be 15 minutes late"
  • "Open Twitter in the browser and write a post about 'something'"
  • "Read my latest notifications and say if there are any important ones"
  • "Write a linkedin post about 'something'"

And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core.

About the Speaker

Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control.

He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye.

August 7 - Understanding Visual Agents

Join us for a virtual event to hear talks from experts on the current state of visual agents.

When

Aug 7, 2025 at 9 AM Pacific

Where

Virtual. Register for the Zoom.

Foundational capabilities and models for generalist agents for computers

As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.

We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.

About the Speaker

Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India.

BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities

The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.

About the Speaker

Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content.

Visual Agents: What it takes to build an agent that can navigate GUIs like humans

We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Implementing a Practical Vision-Based Android AI Agent

In this talk I will share with you practical details of designing and implementing Android AI agents, using deki.

From theory we will move to practice and the usage of these agents in industry/production.

For end users - remote usage of Android phones or for automation of standard tasks. Such as:

  • "Write my friend 'some_name' in WhatsApp that I'll be 15 minutes late"
  • "Open Twitter in the browser and write a post about 'something'"
  • "Read my latest notifications and say if there are any important ones"
  • "Write a linkedin post about 'something'"

And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core.

About the Speaker

Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control.

He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye.

August 7 - Understanding Visual Agents

Join us for a virtual event to hear talks from experts on the current state of visual agents.

When

Aug 7, 2025 at 9 AM Pacific

Where

Virtual. Register for the Zoom.

Foundational capabilities and models for generalist agents for computers

As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.

We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.

About the Speaker

Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India.

BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities

The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.

About the Speaker

Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content.

Visual Agents: What it takes to build an agent that can navigate GUIs like humans

We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Implementing a Practical Vision-Based Android AI Agent

In this talk I will share with you practical details of designing and implementing Android AI agents, using deki.

From theory we will move to practice and the usage of these agents in industry/production.

For end users - remote usage of Android phones or for automation of standard tasks. Such as:

  • "Write my friend 'some_name' in WhatsApp that I'll be 15 minutes late"
  • "Open Twitter in the browser and write a post about 'something'"
  • "Read my latest notifications and say if there are any important ones"
  • "Write a linkedin post about 'something'"

And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core.

About the Speaker

Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control.

He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye.

August 7 - Understanding Visual Agents
Steve Lucas – CEO @ Boomi , Richie – host @ DataCamp

The relationship between humans and AI in the workplace is rapidly evolving beyond simple automation. As companies deploy thousands of AI agents to handle everything from expense approvals to customer success management, a new paradigm is emerging—one where humans become orchestrators rather than operators. But how do you determine which processes should be handled by AI and which require human judgment? What governance structures need to be in place before deploying AI at scale? With the potential to automate up to 80% of business processes, organizations must carefully consider not just the technology, but the human element of AI-driven transformation. Steve Lucas is the Chairman and CEO of Boomi, marking his third tenure as CEO. With nearly 30 years of enterprise software leadership, he has held senior roles at leading cloud organizations including Marketo, iCIMS, Adobe, SAP, Salesforce, and BusinessObjects. He led Marketo through its multi-billion-dollar acquisition by Adobe and drove strategic growth at iCIMS, delivering significant investments and transformation. A proven leader in scaling software companies, Steve is also the author of the national bestseller Digital Impact and holds a business degree from the University of Colorado. In the episode, Richie and Steve explore the importance of choosing the right tech stack for your business, the challenges of managing complex systems, the role of AI in transforming business processes, and the need for effective AI governance. They also discuss the future of AI-driven enterprises and much more. Links Mentioned in the Show: BoomiSteve’s Book - Digital Impact: The Human Element of AI-Driven TransformationWhat is the OSI Model?Connect with SteveSkill Track: AI Business FundamentalsRelated Episode: New Models for Digital Transformation with Alison McCauley Chief Advocacy Officer at Think with AI & Founder of Unblocked FutureRewatch RADAR AI  New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

AI/ML Cloud Computing SAP
DataFramed

This is a virtual event.

Register for the Zoom

Towards a Multimodal AI Agent that Can See, Talk and Act

The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions.

First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system.

Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds.

About the Speaker

Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments.

ConceptAttention: Interpreting the Representations of Diffusion Transformers

Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment.

We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models!

Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing.

About the Speaker

Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts.

RelationField: Relate Anything in Radiance Fields

Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance.

About the Speaker

Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment

RGB-X Model Development: Exploring Four Channel ML Workflows

Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats!

About the Speaker

Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience.

April 24, 2025 - AI, Machine Learning and Computer Vision Meetup

This is a virtual event.

Register for the Zoom

Towards a Multimodal AI Agent that Can See, Talk and Act

The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions.

First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system.

Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds.

About the Speaker

Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments.

ConceptAttention: Interpreting the Representations of Diffusion Transformers

Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment.

We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models!

Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing.

About the Speaker

Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts.

RelationField: Relate Anything in Radiance Fields

Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance.

About the Speaker

Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment

RGB-X Model Development: Exploring Four Channel ML Workflows

Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats!

About the Speaker

Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience.

April 24, 2025 - AI, Machine Learning and Computer Vision Meetup

This is a virtual event.

Register for the Zoom

Towards a Multimodal AI Agent that Can See, Talk and Act

The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions.

First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system.

Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds.

About the Speaker

Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments.

ConceptAttention: Interpreting the Representations of Diffusion Transformers

Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment.

We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models!

Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing.

About the Speaker

Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts.

RelationField: Relate Anything in Radiance Fields

Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance.

About the Speaker

Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment.

RGB-X Model Development: Exploring Four Channel ML Workflows

Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats!

About the Speaker

Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience.

April 24, 2025 - AI, Machine Learning and Computer Vision Meetup

This is a virtual event.

Register for the Zoom

Towards a Multimodal AI Agent that Can See, Talk and Act

The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions.

First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system.

Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds.

About the Speaker

Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments.

ConceptAttention: Interpreting the Representations of Diffusion Transformers

Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment.

We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models!

Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing.

About the Speaker

Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts.

RelationField: Relate Anything in Radiance Fields

Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance.

About the Speaker

Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment

RGB-X Model Development: Exploring Four Channel ML Workflows

Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats!

About the Speaker

Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience.

April 24, 2025 - AI, Machine Learning and Computer Vision Meetup