talk-data.com
Activities & events
| Title & Speakers | Event |
|---|---|
|
August 7 - Understanding Visual Agents
2025-08-07 · 16:00
Join us for a virtual event to hear talks from experts on the current state of visual agents. When Aug 7, 2025 at 9 AM Pacific Where Virtual. Register for the Zoom. Foundational capabilities and models for generalist agents for computers As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately. We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents. About the Speaker Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India. BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities. About the Speaker Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content. Visual Agents: What it takes to build an agent that can navigate GUIs like humans We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. Implementing a Practical Vision-Based Android AI Agent In this talk I will share with you practical details of designing and implementing Android AI agents, using deki. From theory we will move to practice and the usage of these agents in industry/production. For end users - remote usage of Android phones or for automation of standard tasks. Such as:
And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core. About the Speaker Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control. He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye. |
August 7 - Understanding Visual Agents
|
|
August 7 - Understanding Visual Agents
2025-08-07 · 16:00
Join us for a virtual event to hear talks from experts on the current state of visual agents. When Aug 7, 2025 at 9 AM Pacific Where Virtual. Register for the Zoom. Foundational capabilities and models for generalist agents for computers As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately. We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents. About the Speaker Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India. BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities. About the Speaker Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content. Visual Agents: What it takes to build an agent that can navigate GUIs like humans We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. Implementing a Practical Vision-Based Android AI Agent In this talk I will share with you practical details of designing and implementing Android AI agents, using deki. From theory we will move to practice and the usage of these agents in industry/production. For end users - remote usage of Android phones or for automation of standard tasks. Such as:
And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core. About the Speaker Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control. He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye. |
August 7 - Understanding Visual Agents
|
|
August 7 - Understanding Visual Agents
2025-08-07 · 16:00
Join us for a virtual event to hear talks from experts on the current state of visual agents. When Aug 7, 2025 at 9 AM Pacific Where Virtual. Register for the Zoom. Foundational capabilities and models for generalist agents for computers As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately. We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents. About the Speaker Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India. BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities. About the Speaker Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content. Visual Agents: What it takes to build an agent that can navigate GUIs like humans We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. Implementing a Practical Vision-Based Android AI Agent In this talk I will share with you practical details of designing and implementing Android AI agents, using deki. From theory we will move to practice and the usage of these agents in industry/production. For end users - remote usage of Android phones or for automation of standard tasks. Such as:
And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core. About the Speaker Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control. He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye. |
August 7 - Understanding Visual Agents
|
|
August 7 - Understanding Visual Agents
2025-08-07 · 16:00
Join us for a virtual event to hear talks from experts on the current state of visual agents. When Aug 7, 2025 at 9 AM Pacific Where Virtual. Register for the Zoom. Foundational capabilities and models for generalist agents for computers As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately. We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents. About the Speaker Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India. BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities. About the Speaker Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content. Visual Agents: What it takes to build an agent that can navigate GUIs like humans We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. Implementing a Practical Vision-Based Android AI Agent In this talk I will share with you practical details of designing and implementing Android AI agents, using deki. From theory we will move to practice and the usage of these agents in industry/production. For end users - remote usage of Android phones or for automation of standard tasks. Such as:
And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core. About the Speaker Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control. He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye. |
August 7 - Understanding Visual Agents
|
|
August 7 - Understanding Visual Agents
2025-08-07 · 16:00
Join us for a virtual event to hear talks from experts on the current state of visual agents. When Aug 7, 2025 at 9 AM Pacific Where Virtual. Register for the Zoom. Foundational capabilities and models for generalist agents for computers As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately. We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents. About the Speaker Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India. BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities. About the Speaker Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content. Visual Agents: What it takes to build an agent that can navigate GUIs like humans We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. Implementing a Practical Vision-Based Android AI Agent In this talk I will share with you practical details of designing and implementing Android AI agents, using deki. From theory we will move to practice and the usage of these agents in industry/production. For end users - remote usage of Android phones or for automation of standard tasks. Such as:
And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core. About the Speaker Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control. He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye. |
August 7 - Understanding Visual Agents
|
|
August 7 - Understanding Visual Agents
2025-08-07 · 16:00
Join us for a virtual event to hear talks from experts on the current state of visual agents. When Aug 7, 2025 at 9 AM Pacific Where Virtual. Register for the Zoom. Foundational capabilities and models for generalist agents for computers As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately. We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents. About the Speaker Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India. BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities. About the Speaker Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content. Visual Agents: What it takes to build an agent that can navigate GUIs like humans We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. Implementing a Practical Vision-Based Android AI Agent In this talk I will share with you practical details of designing and implementing Android AI agents, using deki. From theory we will move to practice and the usage of these agents in industry/production. For end users - remote usage of Android phones or for automation of standard tasks. Such as:
And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core. About the Speaker Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control. He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye. |
August 7 - Understanding Visual Agents
|
|
August 7 - Understanding Visual Agents
2025-08-07 · 16:00
Join us for a virtual event to hear talks from experts on the current state of visual agents. When Aug 7, 2025 at 9 AM Pacific Where Virtual. Register for the Zoom. Foundational capabilities and models for generalist agents for computers As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately. We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents. About the Speaker Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India. BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities. About the Speaker Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content. Visual Agents: What it takes to build an agent that can navigate GUIs like humans We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. Implementing a Practical Vision-Based Android AI Agent In this talk I will share with you practical details of designing and implementing Android AI agents, using deki. From theory we will move to practice and the usage of these agents in industry/production. For end users - remote usage of Android phones or for automation of standard tasks. Such as:
And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core. About the Speaker Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control. He previously worked in Istanbul/Türkiye for various companies as an Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye. |
August 7 - Understanding Visual Agents
|
|
Industry Roundup #5: AI Agents Hype vs. Reality, Meta’s $15B Stake in Scale AI, and the First Fully AI-Generated NBA Ad
2025-07-03 · 10:00
Martijn
– COO
@ DataCamp
Welcome to DataFramed Industry Roundups! In this series of episodes, we sit down to discuss the latest and greatest in data & AI. In this episode, with special guest, DataCamp COO Martijn, we touch upon the hype and reality of AI agents in business, the McKinsey vs. Ethan Mollick debate on simple vs. complex agents, Meta's $15B stake in Scale AI and what it means for data and talent, Apple’s rumored $20B bid for Perplexity amid AI struggles, EU’s push to treat AI skills like reading and math, the first fully AI-generated NBA ad and what it means for creative industries, a new benchmark for deep research tools, and much more. Links Mentioned in the Show: Meta bought Scale AIApple rumoured to buy trying to acquire Perplexity for $20BnMcKinsey's Seizing the Agentic AI Advantage reportThe first fully AI-generated NBA AdEU Generative AI Outlook reportMary Meeker's Trend in AI reportDeep research benchmarkRewatch RADAR AI New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business |
DataFramed |
|
Agentic AI: Using Agents for Deep Research
2025-05-29 · 23:00
Artificial Intelligence is rapidly reshaping how we engage with information and knowledge. Recent advancements in AI, such as OpenAI’s Deep Research and Google Gemini, have sparked widespread excitement—and uncertainty—by promising to accomplish in minutes what traditionally takes weeks or months of human effort. With premium AI tools often priced at hundreds of dollars per month, many AI enthusiasts and professionals are exploring powerful, affordable alternatives in open-source ecosystems. But beyond accessibility, critical questions remain: Can AI genuinely match or exceed human judgment, intuition, and nuanced decision-making? How can individuals across all fields leverage AI to enhance their workflows and productivity rather than compete with these intelligent systems? Join Dr. Daniel Barulli in an engaging event designed to demystify and illuminate the rapidly evolving AI landscape. Dr. Barulli will showcase cutting-edge tools and illustrate the transformative concept of agentic architecture, which amplifies the capabilities of AI models through strategic task automation and advanced reasoning techniques. In this interactive session, you will:
Whether you're new to AI, a seasoned professional, or simply curious about the future of technology, this talk promises valuable insights and practical strategies for harnessing AI to elevate your capabilities. |
Agentic AI: Using Agents for Deep Research
|
|
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
2025-04-24 · 17:00
This is a virtual event. Towards a Multimodal AI Agent that Can See, Talk and Act The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions. First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system. Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds. About the Speaker Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments. ConceptAttention: Interpreting the Representations of Diffusion Transformers Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment. We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models! Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing. About the Speaker Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts. RelationField: Relate Anything in Radiance Fields Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance. About the Speaker Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment RGB-X Model Development: Exploring Four Channel ML Workflows Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats! About the Speaker Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience. |
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
|
|
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
2025-04-24 · 17:00
This is a virtual event. Towards a Multimodal AI Agent that Can See, Talk and Act The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions. First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system. Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds. About the Speaker Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments. ConceptAttention: Interpreting the Representations of Diffusion Transformers Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment. We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models! Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing. About the Speaker Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts. RelationField: Relate Anything in Radiance Fields Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance. About the Speaker Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment RGB-X Model Development: Exploring Four Channel ML Workflows Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats! About the Speaker Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience. |
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
|
|
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
2025-04-24 · 17:00
This is a virtual event. Towards a Multimodal AI Agent that Can See, Talk and Act The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions. First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system. Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds. About the Speaker Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments. ConceptAttention: Interpreting the Representations of Diffusion Transformers Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment. We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models! Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing. About the Speaker Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts. RelationField: Relate Anything in Radiance Fields Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance. About the Speaker Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment. RGB-X Model Development: Exploring Four Channel ML Workflows Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats! About the Speaker Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience. |
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
|
|
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
2025-04-24 · 17:00
This is a virtual event. Towards a Multimodal AI Agent that Can See, Talk and Act The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions. First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system. Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds. About the Speaker Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments. ConceptAttention: Interpreting the Representations of Diffusion Transformers Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment. We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models! Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing. About the Speaker Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts. RelationField: Relate Anything in Radiance Fields Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance. About the Speaker Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment RGB-X Model Development: Exploring Four Channel ML Workflows Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats! About the Speaker Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience. |
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
|
|
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
2025-04-24 · 17:00
This is a virtual event. Towards a Multimodal AI Agent that Can See, Talk and Act The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions. First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system. Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds. About the Speaker Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments. ConceptAttention: Interpreting the Representations of Diffusion Transformers Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment. We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models! Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing. About the Speaker Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts. RelationField: Relate Anything in Radiance Fields Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance. About the Speaker Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment RGB-X Model Development: Exploring Four Channel ML Workflows Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats! About the Speaker Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience. |
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
|
|
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
2025-04-24 · 17:00
This is a virtual event. Towards a Multimodal AI Agent that Can See, Talk and Act The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions. First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system. Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds. About the Speaker Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments. ConceptAttention: Interpreting the Representations of Diffusion Transformers Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment. We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models! Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing. About the Speaker Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts. RelationField: Relate Anything in Radiance Fields Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance. About the Speaker Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment RGB-X Model Development: Exploring Four Channel ML Workflows Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats! About the Speaker Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience. |
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
|
|
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
2025-04-24 · 17:00
This is a virtual event. Towards a Multimodal AI Agent that Can See, Talk and Act The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my research journey toward this grand goal across three key dimensions. First, I will explore how to bridge the gap between core vision understanding and multimodal learning through unified frameworks at various granularities. Next, I will discuss connecting vision-language models with large language models (LLMs) to create intelligent conversational systems. Finally, I will delve into recent advancements that extend multimodal LLMs into vision-language-action models, forming the foundation for general-purpose robotics policies. To conclude, I will highlight ongoing efforts to develop agentic systems that integrate perception with action, enabling them to not only understand observations but also take meaningful actions in a single system. Together, these lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital and physical worlds. About the Speaker Jianwei Yang is a Principal Researcher at Microsoft Research (MSR), Redmond. His research focuses on the intersection of vision and multimodal learning, with an emphasis on bridging core vision tasks with language, building general-purpose and promptable multimodal models, and enabling these models to take meaningful actions in both virtual and physical environments. ConceptAttention: Interpreting the Representations of Diffusion Transformers Recently, diffusion transformers have taken over as the state-of-the-art model class for both image and video generation. However, similar to many existing deep learning architectures, their high-dimensional hidden representations are difficult to understand and interpret. This lack of interpretability is a barrier to their controllability and safe deployment. We introduce ConceptAttention, an approach to interpreting the representations of diffusion transformers. Our method allows users to create rich saliency maps depicting the location and intensity of textual concepts. Our approach exposes how a diffusion model “sees” a generated image and notably requires no additional training. ConceptAttention improves upon widely used approaches like cross attention maps for isolating the location of visual concepts and even generalizes to real world (not just generated) images and video generation models! Our work serves to improve the community’s understanding of how diffusion models represent data and has numerous potential applications, like image editing. About the Speaker Alec Helbling is a PhD student at Georgia Tech. His research focuses on improving the interpretability and controllability of generative models, particularly for image generation. His research is more application focused, and he has have interned at a variety of industrial research labs like Adobe Firefly, IBM Research, and NASA Jet Propulsion Lab. He also has a passion for creating explanatory videos of interesting machine learning and mathematical concepts. RelationField: Relate Anything in Radiance Fields Neural radiance fields recently emerged as a 3D scene representation extended by distilling open-vocabulary features from vision-language models. Current methods focus on object-centric tasks, leaving semantic relationships largely unexplored. We propose RelationField, the first method extracting inter-object relationships directly from neural radiance fields using pairs of rays for implicit relationship queries. RelationField distills relationship knowledge from multi-modal LLMs. Evaluated on open-vocabulary 3D scene graph generation and relationship-guided instance segmentation, RelationField achieves state-of-the-art performance. About the Speaker Sebastian Koch is a PhD student at Ulm University and Bosch Center for Artificial Intelligence. He is supervised by Timo Ropinski from Ulm University. His main research interest lies at the intersection of computer vision and robotics. The goal of his PhD is to develop 3D scene representations of the real world that are valuable for robots to navigate and solve tasks within their environment. RGB-X Model Development: Exploring Four Channel ML Workflows Machine Learning is rapidly becoming multimodal. With many models in Computer Vision expanding to areas like vision and 3D, one area that has also quietly been advancing rapidly is RGB-X data, such as infrared, depth, or normals. In this talk we will cover some of the leading models in this exploding field of Visual AI and show some best practices on how to work with these complex data formats! About the Speaker Daniel Gural is a seasoned Machine Learning Evangelist with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data. Currently serving as a valuable member of Voxel51, he takes a leading role in efforts to bridge the gap between practitioners and the necessary tools, enabling them to achieve exceptional outcomes. Daniel’s extensive experience in teaching and developing within the ML field has fueled his commitment to democratizing high-quality AI workflows for a wider audience. |
April 24, 2025 - AI, Machine Learning and Computer Vision Meetup
|
|
March 20 - AI, Machine Learning and Computer Vision Meetup
2025-03-20 · 15:30
This is a virtual event. Vision Language Models Are Few-Shot Audio Spectrogram Classifiers The development of multimodal AI agents marks a pivotal step toward creating systems Current audio language models lag behind the text-based LLMs and Vision Language Models (VLMs) in reasoning capabilities. Incorporating audio information into VLMs could help us leverage their advanced language reasoning capabilities for audio input. To explore this, this talk will cover how VLMs (such as GPT-4o and Claude-3.5-sonnet) can recognize audio content from spectrograms and how this approach could enhance audio understanding within VLMs. About the Speaker Satvik Dixit is a masters student at Carnegie Mellon University, advised by Professors Bhiksha Raj and Chris Donahue. His research interests are Audio/Speech Processing and Multimodal Learning, with focus on audio understanding and generation tasks. More details can be found at: https://satvik-dixit.github.io/ Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG In this talk, we will explore Agentic Retrieval-Augmented Generation, or Agentic RAG, a groundbreaking method that enhances Large Language Models (LLMs) by combining intelligent retrieval with autonomous agents. We will discover how Agentic RAG leverages advanced agentic behaviors such as reflection, planning, tool use, and multiagent collaboration to dynamically refine retrieval strategies and adapt workflows, significantly improving real-time responsiveness and complex task management About the Speaker Aditi Singh is an Assistant College Lecturer in the Department of Computer Science at Cleveland State University, Cleveland, Ohio. She earned her M.S. and Ph.D. in Computer Science from Kent State University. She was awarded a prestigious Gold Medal for academic excellence during her undergraduate studies. Her research interests include Artificial Intelligence, Large Language Models (LLM), and Generative AI. Dr. Singh has published over 25 research papers in these fields. Active Data Curation Effectively Distills Large-Scale Multimodal Models Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this talk, I will describe an alternative, yet simple approach — active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks. About the Speaker Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at Google Deepmind. He did his undergraduate degree in computer science in IIIT Delhi from 2016 to 2020, and his masters in machine learning in The University of Cambridge in 2021. Dataset Safari: Adventures from 2024’s Top Computer Vision Conferences Datasets are the lifeblood of machine learning, driving innovation and enabling breakthrough applications in computer vision and AI. This talk presents a curated exploration of the most compelling visual datasets unveiled at CVPR, ECCV, and NeurIPS 2024, with a unique twist – we’ll explore them live using FiftyOne, the open-source tool for dataset curation and analysis. Using FiftyOne’s powerful visualization and analysis capabilities, we’ll take a deep dive into these collections, examining their unique characteristics through interactive sessions. We’ll demonstrate how to:
Whether you’re a researcher, practitioner, or dataset enthusiast, this session will provide hands-on insights into both the datasets shaping our field and practical tools for dataset exploration. Join us for a live demonstration of how modern dataset analysis tools can unlock deeper understanding of the data driving AI forward. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. |
March 20 - AI, Machine Learning and Computer Vision Meetup
|
|
March 20 - AI, Machine Learning and Computer Vision Meetup
2025-03-20 · 15:30
This is a virtual event. Vision Language Models Are Few-Shot Audio Spectrogram Classifiers The development of multimodal AI agents marks a pivotal step toward creating systems Current audio language models lag behind the text-based LLMs and Vision Language Models (VLMs) in reasoning capabilities. Incorporating audio information into VLMs could help us leverage their advanced language reasoning capabilities for audio input. To explore this, this talk will cover how VLMs (such as GPT-4o and Claude-3.5-sonnet) can recognize audio content from spectrograms and how this approach could enhance audio understanding within VLMs. About the Speaker Satvik Dixit is a masters student at Carnegie Mellon University, advised by Professors Bhiksha Raj and Chris Donahue. His research interests are Audio/Speech Processing and Multimodal Learning, with focus on audio understanding and generation tasks. More details can be found at: https://satvik-dixit.github.io/ Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG In this talk, we will explore Agentic Retrieval-Augmented Generation, or Agentic RAG, a groundbreaking method that enhances Large Language Models (LLMs) by combining intelligent retrieval with autonomous agents. We will discover how Agentic RAG leverages advanced agentic behaviors such as reflection, planning, tool use, and multiagent collaboration to dynamically refine retrieval strategies and adapt workflows, significantly improving real-time responsiveness and complex task management About the Speaker Aditi Singh is an Assistant College Lecturer in the Department of Computer Science at Cleveland State University, Cleveland, Ohio. She earned her M.S. and Ph.D. in Computer Science from Kent State University. She was awarded a prestigious Gold Medal for academic excellence during her undergraduate studies. Her research interests include Artificial Intelligence, Large Language Models (LLM), and Generative AI. Dr. Singh has published over 25 research papers in these fields. Active Data Curation Effectively Distills Large-Scale Multimodal Models Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this talk, I will describe an alternative, yet simple approach — active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks. About the Speaker Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at Google Deepmind. He did his undergraduate degree in computer science in IIIT Delhi from 2016 to 2020, and his masters in machine learning in The University of Cambridge in 2021. Dataset Safari: Adventures from 2024’s Top Computer Vision Conferences Datasets are the lifeblood of machine learning, driving innovation and enabling breakthrough applications in computer vision and AI. This talk presents a curated exploration of the most compelling visual datasets unveiled at CVPR, ECCV, and NeurIPS 2024, with a unique twist – we’ll explore them live using FiftyOne, the open-source tool for dataset curation and analysis. Using FiftyOne’s powerful visualization and analysis capabilities, we’ll take a deep dive into these collections, examining their unique characteristics through interactive sessions. We’ll demonstrate how to:
Whether you’re a researcher, practitioner, or dataset enthusiast, this session will provide hands-on insights into both the datasets shaping our field and practical tools for dataset exploration. Join us for a live demonstration of how modern dataset analysis tools can unlock deeper understanding of the data driving AI forward. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. |
March 20 - AI, Machine Learning and Computer Vision Meetup
|
|
March 20 - AI, Machine Learning and Computer Vision Meetup
2025-03-20 · 15:30
This is a virtual event. Vision Language Models Are Few-Shot Audio Spectrogram Classifiers The development of multimodal AI agents marks a pivotal step toward creating systems Current audio language models lag behind the text-based LLMs and Vision Language Models (VLMs) in reasoning capabilities. Incorporating audio information into VLMs could help us leverage their advanced language reasoning capabilities for audio input. To explore this, this talk will cover how VLMs (such as GPT-4o and Claude-3.5-sonnet) can recognize audio content from spectrograms and how this approach could enhance audio understanding within VLMs. About the Speaker Satvik Dixit is a masters student at Carnegie Mellon University, advised by Professors Bhiksha Raj and Chris Donahue. His research interests are Audio/Speech Processing and Multimodal Learning, with focus on audio understanding and generation tasks. More details can be found at: https://satvik-dixit.github.io/ Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG In this talk, we will explore Agentic Retrieval-Augmented Generation, or Agentic RAG, a groundbreaking method that enhances Large Language Models (LLMs) by combining intelligent retrieval with autonomous agents. We will discover how Agentic RAG leverages advanced agentic behaviors such as reflection, planning, tool use, and multiagent collaboration to dynamically refine retrieval strategies and adapt workflows, significantly improving real-time responsiveness and complex task management About the Speaker Aditi Singh is an Assistant College Lecturer in the Department of Computer Science at Cleveland State University, Cleveland, Ohio. She earned her M.S. and Ph.D. in Computer Science from Kent State University. She was awarded a prestigious Gold Medal for academic excellence during her undergraduate studies. Her research interests include Artificial Intelligence, Large Language Models (LLM), and Generative AI. Dr. Singh has published over 25 research papers in these fields. Active Data Curation Effectively Distills Large-Scale Multimodal Models Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this talk, I will describe an alternative, yet simple approach — active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks. About the Speaker Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at Google Deepmind. He did his undergraduate degree in computer science in IIIT Delhi from 2016 to 2020, and his masters in machine learning in The University of Cambridge in 2021. Dataset Safari: Adventures from 2024’s Top Computer Vision Conferences Datasets are the lifeblood of machine learning, driving innovation and enabling breakthrough applications in computer vision and AI. This talk presents a curated exploration of the most compelling visual datasets unveiled at CVPR, ECCV, and NeurIPS 2024, with a unique twist – we’ll explore them live using FiftyOne, the open-source tool for dataset curation and analysis. Using FiftyOne’s powerful visualization and analysis capabilities, we’ll take a deep dive into these collections, examining their unique characteristics through interactive sessions. We’ll demonstrate how to:
Whether you’re a researcher, practitioner, or dataset enthusiast, this session will provide hands-on insights into both the datasets shaping our field and practical tools for dataset exploration. Join us for a live demonstration of how modern dataset analysis tools can unlock deeper understanding of the data driving AI forward. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. |
March 20 - AI, Machine Learning and Computer Vision Meetup
|
|
March 20 - AI, Machine Learning and Computer Vision Meetup
2025-03-20 · 15:30
This is a virtual event. Vision Language Models Are Few-Shot Audio Spectrogram Classifiers The development of multimodal AI agents marks a pivotal step toward creating systems Current audio language models lag behind the text-based LLMs and Vision Language Models (VLMs) in reasoning capabilities. Incorporating audio information into VLMs could help us leverage their advanced language reasoning capabilities for audio input. To explore this, this talk will cover how VLMs (such as GPT-4o and Claude-3.5-sonnet) can recognize audio content from spectrograms and how this approach could enhance audio understanding within VLMs. About the Speaker Satvik Dixit is a masters student at Carnegie Mellon University, advised by Professors Bhiksha Raj and Chris Donahue. His research interests are Audio/Speech Processing and Multimodal Learning, with focus on audio understanding and generation tasks. More details can be found at: https://satvik-dixit.github.io/ Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG In this talk, we will explore Agentic Retrieval-Augmented Generation, or Agentic RAG, a groundbreaking method that enhances Large Language Models (LLMs) by combining intelligent retrieval with autonomous agents. We will discover how Agentic RAG leverages advanced agentic behaviors such as reflection, planning, tool use, and multiagent collaboration to dynamically refine retrieval strategies and adapt workflows, significantly improving real-time responsiveness and complex task management About the Speaker Aditi Singh is an Assistant College Lecturer in the Department of Computer Science at Cleveland State University, Cleveland, Ohio. She earned her M.S. and Ph.D. in Computer Science from Kent State University. She was awarded a prestigious Gold Medal for academic excellence during her undergraduate studies. Her research interests include Artificial Intelligence, Large Language Models (LLM), and Generative AI. Dr. Singh has published over 25 research papers in these fields. Active Data Curation Effectively Distills Large-Scale Multimodal Models Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this talk, I will describe an alternative, yet simple approach — active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks. About the Speaker Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at Google Deepmind. He did his undergraduate degree in computer science in IIIT Delhi from 2016 to 2020, and his masters in machine learning in The University of Cambridge in 2021. Dataset Safari: Adventures from 2024’s Top Computer Vision Conferences Datasets are the lifeblood of machine learning, driving innovation and enabling breakthrough applications in computer vision and AI. This talk presents a curated exploration of the most compelling visual datasets unveiled at CVPR, ECCV, and NeurIPS 2024, with a unique twist – we’ll explore them live using FiftyOne, the open-source tool for dataset curation and analysis. Using FiftyOne’s powerful visualization and analysis capabilities, we’ll take a deep dive into these collections, examining their unique characteristics through interactive sessions. We’ll demonstrate how to:
Whether you’re a researcher, practitioner, or dataset enthusiast, this session will provide hands-on insights into both the datasets shaping our field and practical tools for dataset exploration. Join us for a live demonstration of how modern dataset analysis tools can unlock deeper understanding of the data driving AI forward. About the Speaker Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI. |
March 20 - AI, Machine Learning and Computer Vision Meetup
|