Search – talk-data.com

Title & Speakers	Event
Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00 Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases. Time and Location Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom! VIDEOP2R: Video Understanding from Perception to Reasoning Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. About the Speaker Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models. Layer-Aware Video Composition via Split-then-Merge Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations. About the Speaker Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety. Video Reasoning for Worker Safety Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook. By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. Video Intelligence Is Going Agentic Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system. About the Speaker James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.	Feb 11 - Visual AI for Video Use Cases
Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00 Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases. Time and Location Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom! VIDEOP2R: Video Understanding from Perception to Reasoning Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. About the Speaker Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models. Layer-Aware Video Composition via Split-then-Merge Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations. About the Speaker Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety. Video Reasoning for Worker Safety Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook. By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. Video Intelligence Is Going Agentic Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system. About the Speaker James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.	Feb 11 - Visual AI for Video Use Cases
Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00 Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases. Time and Location Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom! VIDEOP2R: Video Understanding from Perception to Reasoning Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. About the Speaker Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models. Layer-Aware Video Composition via Split-then-Merge Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations. About the Speaker Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety. Video Reasoning for Worker Safety Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook. By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. Video Intelligence Is Going Agentic Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system. About the Speaker James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.	Feb 11 - Visual AI for Video Use Cases
Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00 Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases. Time and Location Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom! VIDEOP2R: Video Understanding from Perception to Reasoning Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. About the Speaker Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models. Layer-Aware Video Composition via Split-then-Merge Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations. About the Speaker Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety. Video Reasoning for Worker Safety Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook. By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. Video Intelligence Is Going Agentic Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system. About the Speaker James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.	Feb 11 - Visual AI for Video Use Cases
Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00 Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases. Time and Location Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom! VIDEOP2R: Video Understanding from Perception to Reasoning Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. About the Speaker Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models. Layer-Aware Video Composition via Split-then-Merge Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations. About the Speaker Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety. Video Reasoning for Worker Safety Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook. By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. Video Intelligence Is Going Agentic Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system. About the Speaker James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.	Feb 11 - Visual AI for Video Use Cases
Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00 Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases. Time and Location Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom! VIDEOP2R: Video Understanding from Perception to Reasoning Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. About the Speaker Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models. Layer-Aware Video Composition via Split-then-Merge Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations. About the Speaker Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety. Video Reasoning for Worker Safety Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook. By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. Video Intelligence Is Going Agentic Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system. About the Speaker James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.	Feb 11 - Visual AI for Video Use Cases
Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00 Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases. Time and Location Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom! VIDEOP2R: Video Understanding from Perception to Reasoning Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. About the Speaker Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models. Layer-Aware Video Composition via Split-then-Merge Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations. About the Speaker Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety. Video Reasoning for Worker Safety Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook. By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. Video Intelligence Is Going Agentic Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system. About the Speaker James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.	Feb 11 - Visual AI for Video Use Cases
Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00 Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases. Time and Location Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom! VIDEOP2R: Video Understanding from Perception to Reasoning Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. About the Speaker Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models. Layer-Aware Video Composition via Split-then-Merge Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations. About the Speaker Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety. Video Reasoning for Worker Safety Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook. By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. Video Intelligence Is Going Agentic Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system. About the Speaker James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.	Feb 11 - Visual AI for Video Use Cases
AI in Robotics Meetup 2026-02-04 · 16:00 We're holding our biggest-ever meetup on February 4th! We've gotten A LOT of requests for a meetup around the current state of AI in robotics from the AI on the Amstel community. So it's finally happening! We're doing the AI in robotics meetup! At the biggest venue we've ever used. This will be a unique meetup. During the networking period before and after the panel discussion, we’ll have space for Dutch robotics startups to showcase their robotics projects. Get ready to check out some robots in action! We'll announce the panel soon. This meetup on AI in robotics will be held at the AMS Institute. The AMS Institute is a scientific institution and graduate school founded by TU Delft, Wageningen University, MIT, and the City of Amsterdam. They have a 400 person theater in Amsterdam. We're getting ready for a big event! For those that are new to this meetup, the *AI on the Amstel meetup* is a monthly technical meetup that features knowledgeable speakers discussing "in the weeds" AI topics and challenges. This meetup caters to engineers, product managers, and founders in AI. Doors open at 17:00, and the panel will start around 18:00. Finally, since we get asked about this a lot, it's fine if you arrive late! We know work can be busy!	AI in Robotics Meetup
Context-Aware Vision Systems Using Knowledge Graphs 2025-11-25 · 15:00 ❄️ DSF WinterFest 2025: Global Online Summit ❄️ Join the global data celebration! Monday 24th to Friday 28th November 2025 Online \\| 2-3 sessions per day \\| Theme: Innovating with Data DSF WinterFest is back, and this year, it’s going global! Join our 50,000-strong community for a week of world-class talks, tutorials, and panels exploring how data, AI, and analytics are reshaping the world. Expect inspiring content, expert insights, and the cosy, welcoming DSF atmosphere we are known for, all from the comfort of your own space! Why join? 🌍 A global stage with speakers and attendees from every corner of the world 🎟️ One ticket for the full week. Register once and access every session 💻 Easy access from anywhere. Join live or catch replays in your own time ☕ Cosy community vibe. No travel, no stress, just data and connection ➡️ REGISTER HERE FOR FREE! ⬅️ 🎟️ Tickets: Choose your experience and secure your spot today: Free Pass - Watch live and enjoy replays until 30 November 2025 Upgrade at Checkout - Get extended replay access until May 2026 Register on our website to receive your joining links, add sessions to your calendar, and tune in live from anywhere in the world. Please note: Clicking “Attend” on Meetup does not register you for this summit. You must register via our website to receive your links. 🎁 Competition: We’re spreading festive cheer! One lucky attendee will win a £300 Amazon gift voucher (or equivalent in your currency). Find out more here. ❄️❄️❄️ Session details: 💡 Context-Aware Vision Systems Using Knowledge Graphs 🗓️ Tuesday 25th November ⏰ 15:00 PM GMT 🗣️ Niyati P, Software, ML Lead Traditional computer vision systems often rely solely on pixel-level features and deep learning to interpret images, limiting their ability to understand complex scenes or generalize across contexts. This talk introduces a new paradigm: context-aware vision systems powered by knowledge graphs. By embedding structured semantic knowledge into vision pipelines, machines can infer not just what is seen, but why it matters. We explore how knowledge graphs provide contextual cues—such as object relationships, scene hierarchies, and domain-specific constraints—that enhance visual perception and reasoning. From improving object recognition in ambiguous environments to enabling semantic scene parsing and zero-shot learning, this approach bridges the gap between raw visual data and high-level cognition. The session will highlight recent breakthroughs, implementation strategies using Graph Neural Networks (GNNs), and applications in domains like autonomous driving, medical imaging, and industrial robotics. Attendees will leave with insights on designing vision systems that not only see—but understand. ➡️ REGISTER HERE FOR FREE! ⬅️ ❄️❄️❄️ 🔗 How to join: Once registered, you’ll receive your unique joining link by email, plus handy reminders one week, one day, and one hour before each session. Don't forget to add the sessions you are attending to your calendar. If you can’t make it live, don’t worry, your ticket includes replay access until 30 November 2025 (or May 2026 with the upgrade). 📘 Reminders: Time zones: All sessions are listed in GMT - please check your local time when registering. Recordings: Access replays until 30 November 2025 with a free pass, or until May 2026 with an upgraded ticket Please note: Clicking “Attend” on Meetup does not register you for this summit. You must register via our website to receive your links. ➡️ REGISTER HERE FOR FREE! ⬅️ Join the Celebration ❄️ Five days. Global speakers. Cutting-edge insights. Free to join live - replays included. Upgrade for extended access. Register now and be part of the global data community shaping the future. #DSFWinterFest	Context-Aware Vision Systems Using Knowledge Graphs
Berlin Cybersecurity Social #18: AI & Cybersecurity Sessions 2025-07-31 · 15:00 Are you a cybersecurity professional looking to connect with like-minded professionals, share experiences, and make friends? Look no further! Join us for a special edition of the Berlin Cybersecurity Social hosted in collaboration with the Venture Café Berlin and the AI Ethics Action Hub for a fantastic evening of networking. Agenda: 5:00 PM - 5:15 PM: Welcome 5:15 PM – 5:50 PM: Lightning Talk: AI Threat Modeling: how to bring the right mindset to detect and prevent AI risk - Iryna Schwindt In this talk, we'll explore the full spectrum of AI risks—not just security-related ones—and why understanding the application context is critical. You'll learn: - What are the types of AI risks (not only security risks) - Why AI application context matters - How to identify potential threats\, apply effective controls and guardrails - Cultivating the right mindset to detect and prevent AI risks * 5:50 PM - 6:30 PM: Panel: AI Meets Cybersecurity: Building Smarter, Safer Systems at Scale - Jose Quesada, Diana Waithanji, Ali Yazdani, Pranav Vattaparambil As AI rapidly integrates into every layer of digital infrastructure, the stakes for cybersecurity have never been higher. This panel brings together experts from across the security spectrum—ranging from DevSecOps and enterprise risk to cybersecurity strategy—to explore how AI is transforming threat detection, governance, and secure system design. We’ll dive into real-world use cases, emerging risks, and what it takes to build scalable, intelligent, and secure systems in an increasingly AI-driven world. 6:30 PM - 8:00 PM: Breakout Session: Cybersecurity in the Age of AI: Ethics & Human-Centered Future Featuring Azer Aliyev (speakinprivate.com), Gunay Kazimzade (Mercedes-Benz Consulting), and Justin Shenk (AI Salon Berlin), this fast-paced session brings together innovators, researchers, and tech leaders to explore how to build AI systems that protect privacy, bolster trust, and keep humans at the heart of digital transformation. This session is organised by the AI Ethics Action Hub About the Speakers: Iryna Schwindt is a Cybersecurity engineer currently at Vodafone and a co-author at the OWASP AI Exchange (https://owaspai.org/) project, contributing to the EU AI Act security standard and AI Red Teaming. Jose Quesada is the founder and director of Data Science Retreat (DSR), an advanced ML bootcamp that has helped over 300 professionals land data science roles. With a PhD and 20+ years in machine learning, Jose brings a unique blend of technical depth and creative flair—he’s also a former photorealism artist. He has advised on impactful projects ranging from malaria diagnostics to sustainability-focused robotics. Diana Waithanji is a Cybersecurity Engineer at SAP SE, with experience working across Europe and Africa. She is an advocate for data privacy as a fundamental human right and serves on two technical committees at the Kenya Bureau of Standards. Diana is also a board member at Nivishe Foundation, where she supports youth mental health through safe spaces. Her work bridges global standards, social impact, and cutting-edge security practices. Ali Yazdani is a seasoned security professional with over a decade of experience spanning offensive security and secure development practices. Starting his career as a penetration tester, he now specializes in building scalable DevSecOps programs and embedding security into engineering workflows. Ali brings deep technical knowledge and a pragmatic approach to security culture. His mission is to empower teams to build safer software at scale and is currently a founder at Scandog.io Pranav Vattaparambil is Chief Security Officer at Unosecur (https://www.unosecur.com/) as well as a security and product strategist with deep expertise in fintech. Formerly VP of Cybersecurity at the EU’s largest Banking-as-a-Service company, he also advises multiple startups on navigating security, risk, and go-to-market strategy. Pranav bridges the gap between technical execution and business impact, especially in regulated industries like banking and crypto. His focus is on helping companies build secure, scalable products from day one. About Venture Café Berlin: Venture Café Berlin connects a community of innovators and entrepreneurs with free high-impact programming and events. Venture Café is a part of the CIC network, whose mission is to fix the world through innovation. About Berlin Cybersecurity Social: This meetup is open to cybersecurity professionals of all levels, from beginners to experts. Whether you're a seasoned pro or just starting your journey in the field, this event is the perfect opportunity to connect with others who share your passion for cybersecurity. About the AI Ethics Action Hub: A global, interdisciplinary collective dedicated to advancing ethical, inclusive, and accountable AI. We believe technology should be designed to respecting human dignity, planetary well-being, and intergenerational justice.	Berlin Cybersecurity Social #18: AI & Cybersecurity Sessions
July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00 When and Where July 17\, 2025 \\| 10:00 – 11:30 AM Pacific Virtually over Zoom. Sign up! Using VLMs to Navigate the Sea of Data At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort. About the Speaker Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer. SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters. About the Speaker Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks. Building Efficient and Reliable Workflows for Object Detection Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines. About the Speaker Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club. Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.	July 17 - AI, ML and Computer Vision Meetup
July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00 When and Where July 17\, 2025 \\| 10:00 – 11:30 AM Pacific Virtually over Zoom. Sign up! Using VLMs to Navigate the Sea of Data At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort. About the Speaker Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer. SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters. About the Speaker Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks. Building Efficient and Reliable Workflows for Object Detection Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines. About the Speaker Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club. Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.	July 17 - AI, ML and Computer Vision Meetup
July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00 When and Where July 17\, 2025 \\| 10:00 – 11:30 AM Pacific Virtually over Zoom. Sign up! Using VLMs to Navigate the Sea of Data At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort. About the Speaker Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer. SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters. About the Speaker Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks. Building Efficient and Reliable Workflows for Object Detection Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines. About the Speaker Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club. Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.	July 17 - AI, ML and Computer Vision Meetup
July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00 When and Where July 17\, 2025 \\| 10:00 – 11:30 AM Pacific Virtually over Zoom. Sign up! Using VLMs to Navigate the Sea of Data At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort. About the Speaker Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer. SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters. About the Speaker Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks. Building Efficient and Reliable Workflows for Object Detection Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines. About the Speaker Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club. Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.	July 17 - AI, ML and Computer Vision Meetup
July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00 When and Where July 17\, 2025 \\| 10:00 – 11:30 AM Pacific Virtually over Zoom. Sign up! Using VLMs to Navigate the Sea of Data At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort. About the Speaker Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer. SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters. About the Speaker Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks. Building Efficient and Reliable Workflows for Object Detection Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines. About the Speaker Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club. Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.	July 17 - AI, ML and Computer Vision Meetup
July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00 When and Where July 17\, 2025 \\| 10:00 – 11:30 AM Pacific Virtually over Zoom. Sign up! Using VLMs to Navigate the Sea of Data At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort. About the Speaker Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer. SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters. About the Speaker Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks. Building Efficient and Reliable Workflows for Object Detection Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines. About the Speaker Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club. Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.	July 17 - AI, ML and Computer Vision Meetup
July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00 When and Where July 17\, 2025 \\| 10:00 – 11:30 AM Pacific Virtually over Zoom. Sign up! Using VLMs to Navigate the Sea of Data At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort. About the Speaker Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer. SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters. About the Speaker Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks. Building Efficient and Reliable Workflows for Object Detection Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines. About the Speaker Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club. Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems. About the Speaker Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.	July 17 - AI, ML and Computer Vision Meetup
PyData Slovakia & Bratislava Meetup #30 [Maryam Alimardani: Navigating the PhD] 2025-05-23 · 12:00 Talk Title: "Navigating the PhD Journey" Description: In this talk, we will share how PhD. research on robotics and brain-computer interfaces shaped her academic path as an interdisciplinary researcher who combines different approaches from various fields such as engineering, neuroscience, and human-computer interaction to address real-world problems. Maryam will present specific projects that exemplify this interdisciplinary approach, highlighting the challenges and lessons learned along the way. Speaker/Bio: Dr. ing. Maryam Alimardani Maryam Alimardani \\| Associate Professor @ Vrije Universiteit Amsterdam https://www.linkedin.com/in/maryam-alimardani-a25a5b64/ & https://scholar.google.co.jp/citations?user=QGYjBQoAAAAJ&hl=en Maryam Alimardani is an associate professor at Vrije Universiteit Amsterdam in the Netherlands, specializing in brain-computer interfacing (BCI) and human-computer interaction (HCI). She earned her PhD from the Intelligent Robotics Laboratory at Osaka University, Japan. Her research focuses on developing BCI systems that facilitate personalized interactions with technology, particularly for adaptive training and learning purposes. By utilizing EEG brain signals and advanced AI models, she designs intelligent interfaces capable of monitoring users’ mental states—such as attention and workload—and delivering tailored feedback. Her work has integrated BCI systems into various interactive technologies, including virtual reality (VR) simulations and socialrobots, enhancing user control and overall experience. And not only that/successful academic and top level AI researcher. Maryam's TED-Ed talk (Are you a body with a mind or a mind with a body?) has over 3 Million views and is also among the most viewed on YouTube platform as for this topic. https://youtu.be/ILDy6kYU-xQ?si=b0RQq8Tnyr6YDqoQ https://research.vu.nl/en/persons/maryam-alimardani https://sites.google.com/site/maryamalimardany/home https://www.researchgate.net/lab/Maryam-Alimardani-Lab-2 https://ieeexplore.ieee.org/author/37085 https://independent.academia.edu/MaryamAlimardani2s https://www.catalyzex.com/author/Maryam%20Alimardani https://www.youtube.com/watch?v=YjEwbAMrfH0 https://www.catalyzex.com/paper/predicting-workload-in-virtual-flight LinkedIn: https://www.linkedin.com/in/maryam-alimardani-a25a5b64/ Moderator and Host of the event: Radovan Kavický, President & Principal Data Scientist @ GapData Institute; former AI & Data Science Evangelist @ AIslovakIA - National platform for AI development in Slovakia Fun Fact: Radovan already interviewed Maryam in past (IEEE conference/DISA in Košice) and was published here/if interested, you can find the interview here: Vedkyňa skúma, ako sa dá vidieť do mozgu. Pýtali sme sa, čo s tým robiť Registration: @Meetup.com group's event here (https://www.meetup.com/pydata-slovakia-bratislava/events/307510355/) & @Eventbrite registration here (https://www.eventbrite.com/e/pydata-slovakia-meetup-30-maryam-alimardani-navigating-the-phd-journey-tickets-1341011775319?aff=oddtdtcreator). +our event you can find also @Facebook here (https://www.facebook.com/events/1823801225067220) and LinkedIn here (https://www.linkedin.com/events/7322307170060935168/about/). [Disclaimer: If you just mark "going" @Facebook event we can't guarantee your seat] Language of the event: English PyData Bratislava [Python Data Enthusiasts and Users, Data Scientists & Statisticians of all levels from Slovakia] -- PyData is a group for users and developers of data analysis tools to share ideas and learn from each other. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in data management, processing, analytics, and visualization. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States. The PyData Code of Conduct governs this meetup. To discuss any issues or concerns relating to the code of conduct or the behavior of anyone at a PyData meetup, please contact the organizer or NumFOCUS Executive Director Leah Silen (+1512-222-5449; [email protected]). Our Facebook group you can find here: https://www.facebook.com/groups/1813599648877946/ Our Twitter account here: https://twitter.com/PyDataBA Our LinkedIn group here: https://www.linkedin.com/groups/13506080 Organizers: GapData Institute (https://www.gapdata.org/) (GDI) is a nonprofit nonpartisan research institution harnessing power of data & wisdom of economics for public good. \\|\\| Data. Think. Change. \\|\\| NumFOCUS (http://www.numfocus.org/) is a 501(c)(3) nonprofit that supports and promotes world-class, innovative, open source scientific computing. The mission of NumFOCUS is to promote sustainable high-level programming languages, open code development, and reproducible scientific research.	PyData Slovakia & Bratislava Meetup #30 [Maryam Alimardani: Navigating the PhD]
May 22 - AI, ML and Computer Vision Meetup 2025-05-22 · 17:00 When and Where May 22\, 2025 \\| 10:00 AM Pacific Virtual - Register for the Zoom CountGD: Multi-Modal Open-World Counting We propose CountGD, the first open-world counting model that can count any object specified by text only, visual examples only, or both together. CountGD extends the Grounding DINO architecture and adds components to enable specifying the object with visual examples. This new capability – being able to specify the target object by multi-modalites (text and exemplars) – lead to an improvement in counting accuracy. CountGD is powering multiple products and has been applied to problems across different domains including counting large populations of penguins to monitor the influence of climate change, counting buildings from satellite images, and counting seals for conservation. About the Speaker Niki Amini-Naieni is a DPhil student focusing on developing foundation model capabilities for visual understanding of the open world at the Visual Geometry Group (VGG), Oxford supervised by Andrew Zisserman. In the past, Niki has consulted with Amazon and other companies in robotics and computer vision, interned at SpaceX, and studied computer science and engineering at Cornell. GorillaWatch: Advancing Gorilla Re-Identification and Population Monitoring with AI Accurate monitoring of endangered gorilla populations is critical for conservation efforts in the field, where scientists currently rely on labor-intensive manual video labeling methods. The GorillaWatch project applies visual AI to provide robust re-identification of individual gorillas and generate local population estimates in wildlife encounters. About the Speaker Maximilian von Klinski is a Computer Science student at the Hasso-Plattner-Institut and is currently working on the GorillaWatch project alongside seven fellow students. This Gets Under Your Skin – The Art of Skin Type Classification Skin analysis is deceptively hard: inconsistent portrait quality, lighting variations, and the presence of sunscreen or makeup often obscure what’s truly “under the skin.” In this talk, I’ll share how we built an AI pipeline for skin type classification that tackles these real-world challenges with a combination of vision models. The architecture includes image quality control, facial segmentation, and a final classifier trained on curated dermatological features. About the Speaker Markus Hinsche is the co-founder and CTO of Thea Care, where he builds AI-powered skincare solutions at the intersection of health, beauty, and longevity. He holds a Master’s in Software Engineering from the Hasso Plattner Institute and brings a deep background in AI and product development. A Spot Pattern Is like a Fingerprint: Jaguar Identification Project The Jaguar Identification Project is a citizen science initiative actively engaging the public in conservation efforts in Porto Jofre, Brazil. This project increases awareness and provides an interesting and challenging dataset that requires the use of fine-grained visual classification algorithms. We use this rich dataset for dual purposes: teaching data-centric visual AI and directly contributing to conservation efforts for this vulnerable species. Learn more: Jaguar Identification Project \| Jaguar Conservation NGO in Brazil \| Porto Jofre – Poconé, State of Mato Grosso, Brazil About the Speaker Antonio Rueda-Toicen, an AI Engineer in Berlin, has extensive experience in deploying machine learning models and has taught over 300 professionals. He is currently a Research Scientist at the Hasso Plattner Institute. Since 2019, he has organized the Berlin Computer Vision Group and taught at Berlin’s Data Science Retreat. He specializes in computer vision, cloud technologies, and machine learning. Antonio is also a certified instructor of deep learning and diffusion models in NVIDIA’s Deep Learning Institute.	May 22 - AI, ML and Computer Vision Meetup

Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Feb 11 - Visual AI for Video Use Cases 2026-02-11 · 17:00

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

AI in Robotics Meetup 2026-02-04 · 16:00

We're holding our biggest-ever meetup on February 4th! We've gotten A LOT of requests for a meetup around the current state of AI in robotics from the AI on the Amstel community. So it's finally happening! We're doing the AI in robotics meetup! At the biggest venue we've ever used.

This will be a unique meetup. During the networking period before and after the panel discussion, we’ll have space for Dutch robotics startups to showcase their robotics projects. Get ready to check out some robots in action!

We'll announce the panel soon. This meetup on AI in robotics will be held at the AMS Institute. The AMS Institute is a scientific institution and graduate school founded by TU Delft, Wageningen University, MIT, and the City of Amsterdam. They have a 400 person theater in Amsterdam. We're getting ready for a big event!

For those that are new to this meetup, the AI on the Amstel meetup is a monthly technical meetup that features knowledgeable speakers discussing "in the weeds" AI topics and challenges. This meetup caters to engineers, product managers, and founders in AI.

Doors open at 17:00, and the panel will start around 18:00. Finally, since we get asked about this a lot, it's fine if you arrive late! We know work can be busy!

AI in Robotics Meetup

Context-Aware Vision Systems Using Knowledge Graphs 2025-11-25 · 15:00

❄️ DSF WinterFest 2025: Global Online Summit ❄️

Join the global data celebration!

Monday 24th to Friday 28th November 2025 Online \| 2-3 sessions per day \| Theme: Innovating with Data

DSF WinterFest is back, and this year, it’s going global! Join our 50,000-strong community for a week of world-class talks, tutorials, and panels exploring how data, AI, and analytics are reshaping the world. Expect inspiring content, expert insights, and the cosy, welcoming DSF atmosphere we are known for, all from the comfort of your own space!

Why join? 🌍 A global stage with speakers and attendees from every corner of the world 🎟️ One ticket for the full week. Register once and access every session 💻 Easy access from anywhere. Join live or catch replays in your own time ☕ Cosy community vibe. No travel, no stress, just data and connection

➡️ REGISTER HERE FOR FREE! ⬅️

🎟️ Tickets:

Choose your experience and secure your spot today: Free Pass - Watch live and enjoy replays until 30 November 2025 Upgrade at Checkout - Get extended replay access until May 2026

Register on our website to receive your joining links, add sessions to your calendar, and tune in live from anywhere in the world.

Please note: Clicking “Attend” on Meetup does not register you for this summit. You must register via our website to receive your links.

🎁 Competition:

We’re spreading festive cheer! One lucky attendee will win a £300 Amazon gift voucher (or equivalent in your currency). Find out more here.

❄️❄️❄️

Session details:

💡 Context-Aware Vision Systems Using Knowledge Graphs 🗓️ Tuesday 25th November ⏰ 15:00 PM GMT 🗣️ Niyati P, Software, ML Lead

Traditional computer vision systems often rely solely on pixel-level features and deep learning to interpret images, limiting their ability to understand complex scenes or generalize across contexts. This talk introduces a new paradigm: context-aware vision systems powered by knowledge graphs. By embedding structured semantic knowledge into vision pipelines, machines can infer not just what is seen, but why it matters. We explore how knowledge graphs provide contextual cues—such as object relationships, scene hierarchies, and domain-specific constraints—that enhance visual perception and reasoning. From improving object recognition in ambiguous environments to enabling semantic scene parsing and zero-shot learning, this approach bridges the gap between raw visual data and high-level cognition. The session will highlight recent breakthroughs, implementation strategies using Graph Neural Networks (GNNs), and applications in domains like autonomous driving, medical imaging, and industrial robotics. Attendees will leave with insights on designing vision systems that not only see—but understand.

➡️ REGISTER HERE FOR FREE! ⬅️

❄️❄️❄️

🔗 How to join:

Once registered, you’ll receive your unique joining link by email, plus handy reminders one week, one day, and one hour before each session. Don't forget to add the sessions you are attending to your calendar. If you can’t make it live, don’t worry, your ticket includes replay access until 30 November 2025 (or May 2026 with the upgrade).

📘 Reminders:

Time zones: All sessions are listed in GMT - please check your local time when registering.

Recordings: Access replays until 30 November 2025 with a free pass, or until May 2026 with an upgraded ticket

Please note: Clicking “Attend” on Meetup does not register you for this summit. You must register via our website to receive your links.

➡️ REGISTER HERE FOR FREE! ⬅️

Join the Celebration ❄️ Five days. Global speakers. Cutting-edge insights. Free to join live - replays included. Upgrade for extended access. Register now and be part of the global data community shaping the future. #DSFWinterFest

Context-Aware Vision Systems Using Knowledge Graphs

Berlin Cybersecurity Social #18: AI & Cybersecurity Sessions 2025-07-31 · 15:00

Are you a cybersecurity professional looking to connect with like-minded professionals, share experiences, and make friends? Look no further! Join us for a special edition of the Berlin Cybersecurity Social hosted in collaboration with the Venture Café Berlin and the AI Ethics Action Hub for a fantastic evening of networking.

Agenda:

5:00 PM - 5:15 PM: Welcome
5:15 PM – 5:50 PM: Lightning Talk: AI Threat Modeling: how to bring the right mindset to detect and prevent AI risk - Iryna Schwindt In this talk, we'll explore the full spectrum of AI risks—not just security-related ones—and why understanding the application context is critical. You'll learn: - What are the types of AI risks (not only security risks) - Why AI application context matters - How to identify potential threats\, apply effective controls and guardrails - Cultivating the right mindset to detect and prevent AI risks *
5:50 PM - 6:30 PM: Panel: AI Meets Cybersecurity: Building Smarter, Safer Systems at Scale - Jose Quesada, Diana Waithanji, Ali Yazdani, Pranav Vattaparambil As AI rapidly integrates into every layer of digital infrastructure, the stakes for cybersecurity have never been higher. This panel brings together experts from across the security spectrum—ranging from DevSecOps and enterprise risk to cybersecurity strategy—to explore how AI is transforming threat detection, governance, and secure system design. We’ll dive into real-world use cases, emerging risks, and what it takes to build scalable, intelligent, and secure systems in an increasingly AI-driven world.
6:30 PM - 8:00 PM: Breakout Session: Cybersecurity in the Age of AI: Ethics & Human-Centered Future *Featuring Azer Aliyev (speakinprivate.com), Gunay Kazimzade (Mercedes-Benz Consulting), and Justin Shenk (AI Salon Berlin), this fast-paced session brings together innovators, researchers, and tech leaders to explore how to build AI systems that protect privacy, bolster trust, and keep humans at the heart of digital transformation.

*This session is organised by the AI Ethics Action Hub

About the Speakers:

Iryna Schwindt is a Cybersecurity engineer currently at Vodafone and a co-author at the OWASP AI Exchange (https://owaspai.org/) project, contributing to the EU AI Act security standard and AI Red Teaming.

Jose Quesada is the founder and director of Data Science Retreat (DSR), an advanced ML bootcamp that has helped over 300 professionals land data science roles. With a PhD and 20+ years in machine learning, Jose brings a unique blend of technical depth and creative flair—he’s also a former photorealism artist. He has advised on impactful projects ranging from malaria diagnostics to sustainability-focused robotics.

Diana Waithanji is a Cybersecurity Engineer at SAP SE, with experience working across Europe and Africa. She is an advocate for data privacy as a fundamental human right and serves on two technical committees at the Kenya Bureau of Standards. Diana is also a board member at Nivishe Foundation, where she supports youth mental health through safe spaces. Her work bridges global standards, social impact, and cutting-edge security practices.

Ali Yazdani is a seasoned security professional with over a decade of experience spanning offensive security and secure development practices. Starting his career as a penetration tester, he now specializes in building scalable DevSecOps programs and embedding security into engineering workflows. Ali brings deep technical knowledge and a pragmatic approach to security culture. His mission is to empower teams to build safer software at scale and is currently a founder at Scandog.io

Pranav Vattaparambil is Chief Security Officer at Unosecur (https://www.unosecur.com/) as well as a security and product strategist with deep expertise in fintech. Formerly VP of Cybersecurity at the EU’s largest Banking-as-a-Service company, he also advises multiple startups on navigating security, risk, and go-to-market strategy. Pranav bridges the gap between technical execution and business impact, especially in regulated industries like banking and crypto. His focus is on helping companies build secure, scalable products from day one.

About Venture Café Berlin: Venture Café Berlin connects a community of innovators and entrepreneurs with free high-impact programming and events. Venture Café is a part of the CIC network, whose mission is to fix the world through innovation.

About Berlin Cybersecurity Social: This meetup is open to cybersecurity professionals of all levels, from beginners to experts. Whether you're a seasoned pro or just starting your journey in the field, this event is the perfect opportunity to connect with others who share your passion for cybersecurity.

About the AI Ethics Action Hub: A global, interdisciplinary collective dedicated to advancing ethical, inclusive, and accountable AI. We believe technology should be designed to respecting human dignity, planetary well-being, and intergenerational justice.

Berlin Cybersecurity Social #18: AI & Cybersecurity Sessions

July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00

When and Where

July 17\, 2025 \| 10:00 – 11:30 AM Pacific

Virtually over Zoom. Sign up!

Using VLMs to Navigate the Sea of Data

At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort.

About the Speaker

Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer.

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters.

About the Speaker

Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks.

Building Efficient and Reliable Workflows for Object Detection

Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines.

About the Speaker

Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club.

Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets

High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.

July 17 - AI, ML and Computer Vision Meetup

July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00

When and Where

July 17\, 2025 \| 10:00 – 11:30 AM Pacific

Virtually over Zoom. Sign up!

Using VLMs to Navigate the Sea of Data

At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort.

About the Speaker

Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer.

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters.

About the Speaker

Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks.

Building Efficient and Reliable Workflows for Object Detection

Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines.

About the Speaker

Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club.

Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets

High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.

July 17 - AI, ML and Computer Vision Meetup

July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00

When and Where

July 17\, 2025 \| 10:00 – 11:30 AM Pacific

Virtually over Zoom. Sign up!

Using VLMs to Navigate the Sea of Data

At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort.

About the Speaker

Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer.

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters.

About the Speaker

Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks.

Building Efficient and Reliable Workflows for Object Detection

Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines.

About the Speaker

Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club.

Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets

High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.

July 17 - AI, ML and Computer Vision Meetup

July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00

When and Where

July 17\, 2025 \| 10:00 – 11:30 AM Pacific

Virtually over Zoom. Sign up!

Using VLMs to Navigate the Sea of Data

At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort.

About the Speaker

Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer.

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters.

About the Speaker

Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks.

Building Efficient and Reliable Workflows for Object Detection

Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines.

About the Speaker

Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club.

Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets

High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.

July 17 - AI, ML and Computer Vision Meetup

July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00

When and Where

July 17\, 2025 \| 10:00 – 11:30 AM Pacific

Virtually over Zoom. Sign up!

Using VLMs to Navigate the Sea of Data

At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort.

About the Speaker

Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer.

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters.

About the Speaker

Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks.

Building Efficient and Reliable Workflows for Object Detection

Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines.

About the Speaker

Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club.

Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets

High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.

July 17 - AI, ML and Computer Vision Meetup

July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00

When and Where

July 17\, 2025 \| 10:00 – 11:30 AM Pacific

Virtually over Zoom. Sign up!

Using VLMs to Navigate the Sea of Data

At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort.

About the Speaker

Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer.

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters.

About the Speaker

Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks.

Building Efficient and Reliable Workflows for Object Detection

Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines.

About the Speaker

Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club.

Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets

High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.

July 17 - AI, ML and Computer Vision Meetup

July 17 - AI, ML and Computer Vision Meetup 2025-07-17 · 17:00

When and Where

July 17\, 2025 \| 10:00 – 11:30 AM Pacific

Virtually over Zoom. Sign up!

Using VLMs to Navigate the Sea of Data

At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort.

About the Speaker

Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer.

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters.

About the Speaker

Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks.

Building Efficient and Reliable Workflows for Object Detection

Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines.

About the Speaker

Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club.

Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets

High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.

July 17 - AI, ML and Computer Vision Meetup

PyData Slovakia & Bratislava Meetup #30 [Maryam Alimardani: Navigating the PhD] 2025-05-23 · 12:00

Talk Title: "Navigating the PhD Journey"

Description: In this talk, we will share how PhD. research on robotics and brain-computer interfaces shaped her academic path as an interdisciplinary researcher who combines different approaches from various fields such as engineering, neuroscience, and human-computer interaction to address real-world problems. Maryam will present specific projects that exemplify this interdisciplinary approach, highlighting the challenges and lessons learned along the way.

Speaker/Bio: Dr. ing. Maryam Alimardani Maryam Alimardani \| Associate Professor @ Vrije Universiteit Amsterdam https://www.linkedin.com/in/maryam-alimardani-a25a5b64/ & https://scholar.google.co.jp/citations?user=QGYjBQoAAAAJ&hl=en

Maryam Alimardani is an associate professor at Vrije Universiteit Amsterdam in the Netherlands, specializing in brain-computer interfacing (BCI) and human-computer interaction (HCI). She earned her PhD from the Intelligent Robotics Laboratory at Osaka University, Japan. Her research focuses on developing BCI systems that facilitate personalized interactions with technology, particularly for adaptive training and learning purposes. By utilizing EEG brain signals and advanced AI models, she designs intelligent interfaces capable of monitoring users’ mental states—such as attention and workload—and delivering tailored feedback. Her work has integrated BCI systems into various interactive technologies, including virtual reality (VR) simulations and socialrobots, enhancing user control and overall experience.

And not only that/successful academic and top level AI researcher. Maryam's TED-Ed talk (Are you a body with a mind or a mind with a body?) has over 3 Million views and is also among the most viewed on YouTube platform as for this topic.

https://youtu.be/ILDy6kYU-xQ?si=b0RQq8Tnyr6YDqoQ

https://research.vu.nl/en/persons/maryam-alimardani https://sites.google.com/site/maryamalimardany/home https://www.researchgate.net/lab/Maryam-Alimardani-Lab-2 https://ieeexplore.ieee.org/author/37085

https://independent.academia.edu/MaryamAlimardani2s https://www.catalyzex.com/author/Maryam%20Alimardani https://www.youtube.com/watch?v=YjEwbAMrfH0 https://www.catalyzex.com/paper/predicting-workload-in-virtual-flight

LinkedIn: https://www.linkedin.com/in/maryam-alimardani-a25a5b64/

Moderator and Host of the event: Radovan Kavický, President & Principal Data Scientist @ GapData Institute; former AI & Data Science Evangelist @ AIslovakIA - National platform for AI development in Slovakia

Fun Fact: Radovan already interviewed Maryam in past (IEEE conference/DISA in Košice) and was published here/if interested, you can find the interview here: Vedkyňa skúma, ako sa dá vidieť do mozgu. Pýtali sme sa, čo s tým robiť

Registration:

@Meetup.com group's event here (https://www.meetup.com/pydata-slovakia-bratislava/events/307510355/) & @Eventbrite registration here (https://www.eventbrite.com/e/pydata-slovakia-meetup-30-maryam-alimardani-navigating-the-phd-journey-tickets-1341011775319?aff=oddtdtcreator). +our event you can find also @Facebook here (https://www.facebook.com/events/1823801225067220) and LinkedIn here (https://www.linkedin.com/events/7322307170060935168/about/).

[Disclaimer: If you just mark "going" @Facebook event we can't guarantee your seat]

Language of the event: English

PyData Bratislava [Python Data Enthusiasts and Users, Data Scientists & Statisticians of all levels from Slovakia]

-- PyData is a group for users and developers of data analysis tools to share ideas and learn from each other. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in data management, processing, analytics, and visualization. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States.

The PyData Code of Conduct governs this meetup. To discuss any issues or concerns relating to the code of conduct or the behavior of anyone at a PyData meetup, please contact the organizer or NumFOCUS Executive Director Leah Silen (+1512-222-5449; [email protected]).

Our Facebook group you can find here: https://www.facebook.com/groups/1813599648877946/

Our Twitter account here: https://twitter.com/PyDataBA

Our LinkedIn group here: https://www.linkedin.com/groups/13506080

Organizers: GapData Institute (https://www.gapdata.org/) (GDI) is a nonprofit nonpartisan research institution harnessing power of data & wisdom of economics for public good.

\|\| Data. Think. Change. \|\|

NumFOCUS (http://www.numfocus.org/) is a 501(c)(3) nonprofit that supports and promotes world-class, innovative, open source scientific computing. The mission of NumFOCUS is to promote sustainable high-level programming languages, open code development, and reproducible scientific research.

PyData Slovakia & Bratislava Meetup #30 [Maryam Alimardani: Navigating the PhD]

May 22 - AI, ML and Computer Vision Meetup 2025-05-22 · 17:00

When and Where

May 22\, 2025 \| 10:00 AM Pacific
Virtual - Register for the Zoom

CountGD: Multi-Modal Open-World Counting

We propose CountGD, the first open-world counting model that can count any object specified by text only, visual examples only, or both together. CountGD extends the Grounding DINO architecture and adds components to enable specifying the object with visual examples. This new capability – being able to specify the target object by multi-modalites (text and exemplars) – lead to an improvement in counting accuracy. CountGD is powering multiple products and has been applied to problems across different domains including counting large populations of penguins to monitor the influence of climate change, counting buildings from satellite images, and counting seals for conservation.

About the Speaker

Niki Amini-Naieni is a DPhil student focusing on developing foundation model capabilities for visual understanding of the open world at the Visual Geometry Group (VGG), Oxford supervised by Andrew Zisserman. In the past, Niki has consulted with Amazon and other companies in robotics and computer vision, interned at SpaceX, and studied computer science and engineering at Cornell.

GorillaWatch: Advancing Gorilla Re-Identification and Population Monitoring with AI

Accurate monitoring of endangered gorilla populations is critical for conservation efforts in the field, where scientists currently rely on labor-intensive manual video labeling methods. The GorillaWatch project applies visual AI to provide robust re-identification of individual gorillas and generate local population estimates in wildlife encounters.

About the Speaker

Maximilian von Klinski is a Computer Science student at the Hasso-Plattner-Institut and is currently working on the GorillaWatch project alongside seven fellow students.

This Gets Under Your Skin – The Art of Skin Type Classification

Skin analysis is deceptively hard: inconsistent portrait quality, lighting variations, and the presence of sunscreen or makeup often obscure what’s truly “under the skin.” In this talk, I’ll share how we built an AI pipeline for skin type classification that tackles these real-world challenges with a combination of vision models. The architecture includes image quality control, facial segmentation, and a final classifier trained on curated dermatological features.

About the Speaker

Markus Hinsche is the co-founder and CTO of Thea Care, where he builds AI-powered skincare solutions at the intersection of health, beauty, and longevity. He holds a Master’s in Software Engineering from the Hasso Plattner Institute and brings a deep background in AI and product development.

A Spot Pattern Is like a Fingerprint: Jaguar Identification Project

The Jaguar Identification Project is a citizen science initiative actively engaging the public in conservation efforts in Porto Jofre, Brazil. This project increases awareness and provides an interesting and challenging dataset that requires the use of fine-grained visual classification algorithms. We use this rich dataset for dual purposes: teaching data-centric visual AI and directly contributing to conservation efforts for this vulnerable species.

Learn more: Jaguar Identification Project | Jaguar Conservation NGO in Brazil | Porto Jofre – Poconé, State of Mato Grosso, Brazil

About the Speaker

Antonio Rueda-Toicen, an AI Engineer in Berlin, has extensive experience in deploying machine learning models and has taught over 300 professionals. He is currently a Research Scientist at the Hasso Plattner Institute. Since 2019, he has organized the Berlin Computer Vision Group and taught at Berlin’s Data Science Retreat. He specializes in computer vision, cloud technologies, and machine learning. Antonio is also a certified instructor of deep learning and diffusion models in NVIDIA’s Deep Learning Institute.

May 22 - AI, ML and Computer Vision Meetup

talk-data.com

People (5 results)

Companies (1 result)

Activities & events

PyData Bratislava [Python Data Enthusiasts and Users, Data Scientists & Statisticians of all levels from Slovakia]

The PyData Code of Conduct governs this meetup. To discuss any issues or concerns relating to the code of conduct or the behavior of anyone at a PyData meetup, please contact the organizer or NumFOCUS Executive Director Leah Silen (+1512-222-5449; [email protected]).

\|\| Data. Think. Change. \|\|

People (5 results)

Companies (1 result)

Activities & events

PyData Bratislava [Python Data Enthusiasts and Users, Data Scientists & Statisticians of all levels from Slovakia]

The PyData ​Code of Conduct​ governs this meetup. To discuss any issues or concerns relating to the code of conduct or the behavior of anyone at a PyData meetup, please contact the organizer or NumFOCUS Executive Director Leah Silen (+1512-222-5449; [email protected]).

\|\| Data. Think. Change. \|\|

The PyData Code of Conduct governs this meetup. To discuss any issues or concerns relating to the code of conduct or the behavior of anyone at a PyData meetup, please contact the organizer or NumFOCUS Executive Director Leah Silen (+1512-222-5449; [email protected]).