talk-data.com talk-data.com

Filter by Source

Select conferences and events

People (156 results)

See all 156 →

Activities & events

Title & Speakers Event

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video Reasoning for Worker Safety

Ensuring worker safety in industrial environments requires more than object detection or motion tracking; it demands a genuine understanding of human actions, context, and risk. This talk demonstrates how NVIDIA Cosmos Reason, a multimodal video-reasoning model, interprets workplace scenarios with sophisticated temporal and semantic awareness, identifying nuanced safe and unsafe behaviors that conventional vision systems frequently overlook.

By integrating Cosmos Reason with FiftyOne, users achieve both automated safety assessments and transparent, interpretable explanations revealing why specific actions are deemed hazardous. Using a curated worker-safety dataset of authentic factory-floor footage, we show how video reasoning enhances audits, training, and compliance workflows while minimizing dependence on extensive labeled datasets. The resulting system demonstrates the potential of explainable multimodal AI to enable safer, more informed decision-making across manufacturing, logistics, construction, healthcare, and other sectors where understanding human behavior is essential.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.

Feb 11 - Visual AI for Video Use Cases

Notre soif de puissance de calcul et notre appétit pour les données génèrent une hausse significative des coûts informatiques. Ils contribuent également largement à la hausse des émissions de carbone, qui épuisent nos réserves en électricité et en eau potable. Alors que l’usage du cloud et de l’IA ne cesse de croître, bon nombre d’entre nous mettent en place des solutions pour que cet engouement s’opère tout en préservant notre planète. Rejoignez-nous et vos pairs de la communauté Sustainable Cloud lors de notre prochain Meetup à Paris, organisé par IBM. Un programme chargé qui comprend des intervenants experts et des sessions interactives en petits groupes. Examinons ensemble de près comment nous pouvons utiliser les stratégies FinOps et GreenOps pour rendre l’IT rentable ET durable.

16:30 Accueil 17:00 Bienvenue & Introduction (Mike Bradbury, Antoine Lagier) 17:10 Keynote - Arkema; notre parcours FinOps (Philippe le Blevec) 17:20 Débats éclairs - Samuel Rince\, Algune - Francois Moriamez\, Thales - Carole Crumbach\, Orwell Consulting - James Hall\, GreenPixie - Alex Long\, Apptio 18:00 Tables rondes 18:15 Cocktails & Networking 19:00 Fin

Avis de confidentialité: En vous inscrivant à ce meetup vous consentez au traitement de vos informations personnelles à des fins de gestion et de communication de l'événement. Vos données pourront être partagées avec l'organisateur et les sponsors, uniquement si nécessaire pour le bon déroulement de l'événement. Vous disposez des droits de retrait, d'accès, de correction et de suppression de vos données personnelles. Si vous souhaitez vous désinscrire ou si vous avez des questions concernant l'utilisation de vos données, veuillez nous contacter à [email protected]

FinOps and GreenOps: Pour une Informatique plus Durable et plus Rentable
James Le – Head of Developer Experience @ Twelve Labs

The evolution of video understanding has followed a similar trajectory to language and image understanding - with the rise of large pre-trained foundation models trained on a huge amount of data. Given the surge of multimodal research lately, video foundation models are becoming even more powerful to decipher the rich visual information embedded in videos. This talk will explore diverse use cases of video understanding and provide a glimpse of Twelve Labs offerings.

Chenliang Xu – Associate Professor @ University of Rochester

Recent works find that AI algorithms learn biases from data. Therefore, it is urgent and vital to identify biases in AI algorithms. However, previous bias identification methods overly rely on human experts to conjecture potential biases, which may neglect other underlying biases not realized by humans. Is there an automatic way to assist human experts in finding biases in a broad domain of image classifiers? In this talk, I will introduce solutions.

AI/ML
Luka Posilović – Head of Machine Learning @ Kitro

1/3 of all food gets wasted, with millions of tons of food being thrown away each day. Food does not mean the same thing everywhere in the world, there are thousands of different meals across the world, therefore a lot of different classes to distinguish between. In this talk we’ll see through challenges of food-waste classification and see how foundation models can be useful to this task. We will also explore how we use FiftyOne to test models during development.

AI/ML
Safwan Wshah – Associate Professor @ University of Vermont

Localizing images and objects from visual information stands out as one of the most challenging and dynamic topics in computer vision, owing to its broad applications across different domains. In this talk, we will introduce and delve into several research directions aimed at advancing solutions to these complex problems.

Chenliang Xu – Associate Professor @ University of Rochester

Recent works find that AI algorithms learn biases from data. Therefore, it is urgent and vital to identify biases in AI algorithms. However, previous bias identification methods overly rely on human experts to conjecture potential biases, which may neglect other underlying biases not realized by humans. Is there an automatic way to assist human experts in finding biases in a broad domain of image classifiers? In this talk, I will introduce solutions.

AI/ML
Safwan Wshah – Associate Professor @ University of Vermont

Localizing images and objects from visual information stands out as one of the most challenging and dynamic topics in computer vision, owing to its broad applications across different domains. In this talk, we will introduce and delve into several research directions aimed at advancing solutions to these complex problems.

Luka Posilović – Head of Machine Learning @ Kitro

1/3 of all food gets wasted, with millions of tons of food being thrown away each day. Food does not mean the same thing everywhere in the world, there are thousands of different meals across the world, therefore a lot of different classes to distinguish between. In this talk we’ll see through challenges of food-waste classification and see how foundation models can be useful to this task. We will also explore how we use FiftyOne to test models during development.

AI/ML

When Feb 15, 2024 – 10:00 AM Pacific

Where Virtual / Zoom - https://voxel51.com/computer-vision-events/feb-2024-ai-machine-learning-data-science-meetup/

Agenda

Lightning Talk: The Next Generation of Video Understanding with Twelve Labs

The evolution of video understanding has followed a similar trajectory to language and image understanding - with the rise of large pre-trained foundation models trained on a huge amount of data. Given the surge of multimodal research lately, video foundation models are becoming even more powerful to decipher the rich visual information embedded in videos. This talk will explore diverse use cases of video understanding and provide a glimpse of Twelve Labs offerings.

Speaker: James Le is the Head of Developer Experience at Twelve Labs, a startup building multimodal foundation models for video understanding.

Towards Fair Computer Vision: Discover the Hidden Biases of an Image Classifier

Recent works find that AI algorithms learn biases from data. Therefore, it is urgent and vital to identify biases in AI algorithms. However, previous bias identification methods overly rely on human experts to conjecture potential biases, which may neglect other underlying biases not realized by humans. Is there an automatic way to assist human experts in finding biases in a broad domain of image classifiers? In this talk, I will introduce solutions.

Speaker: Chenliang Xu is an Associate Professor in the Department of Computer Science at the University of Rochester. His research originates in computer vision and tackles interdisciplinary topics, including video understanding, audio-visual learning, vision and language, and methods for trustworthy AI. He has authored over 90 peer-reviewed papers in computer vision, machine learning, multimedia, and AI venues.

Food Waste Classification with AI

1/3 of all food gets wasted, with millions of tons of food being thrown away each day. Food does not mean the same thing everywhere in the world, there are thousands of different meals across the world, therefore a lot of different classes to distinguish between. In this talk we’ll see through challenges of food-waste classification and see how foundation models can be useful to this task. We will also explore how we use FiftyOne to test models during development.

Speaker: Luka Posilović is a computer scientist with a PhD from FER, Zagreb, Croatia, working as a Head of machine learning in Kitro. Him and the team are trying to reduce the global food waste problem by using AI.

Objects and Image Geo-localization from Visual Data

Localizing images and objects from visual information stands out as one of the most challenging and dynamic topics in computer vision, owing to its broad applications across different domains. In this talk, we will introduce and delve into several research directions aimed at advancing solutions to these complex problems.

Speaker: Safwan Wshah is an Associate Professor in the Department of Computer Science at the University of Vermont. His research interests encompass the intersection of machine learning theory and application, with a particular emphasis on geo-localization from visual information. Additionally, he maintains broader interests in deep learning, computer vision, data analytics, and image processing.

Don’t Forget

  • Voxel51 will make a donation on behalf of the Meetup members to the charity that gets the most votes this month.
  • Can’t make the date and time? No problem! Just make sure to register here so we can send you links to the playbacks.
Feb 2024 – AI, Machine Learning & Data Science Meetup

When Feb 15, 2024 – 10:00 AM Pacific

Where Virtual / Zoom - https://voxel51.com/computer-vision-events/feb-2024-ai-machine-learning-data-science-meetup/

Agenda

Lightning Talk: The Next Generation of Video Understanding with Twelve Labs

The evolution of video understanding has followed a similar trajectory to language and image understanding - with the rise of large pre-trained foundation models trained on a huge amount of data. Given the surge of multimodal research lately, video foundation models are becoming even more powerful to decipher the rich visual information embedded in videos. This talk will explore diverse use cases of video understanding and provide a glimpse of Twelve Labs offerings.

Speaker: James Le is the Head of Developer Experience at Twelve Labs, a startup building multimodal foundation models for video understanding.

Towards Fair Computer Vision: Discover the Hidden Biases of an Image Classifier

Recent works find that AI algorithms learn biases from data. Therefore, it is urgent and vital to identify biases in AI algorithms. However, previous bias identification methods overly rely on human experts to conjecture potential biases, which may neglect other underlying biases not realized by humans. Is there an automatic way to assist human experts in finding biases in a broad domain of image classifiers? In this talk, I will introduce solutions.

Speaker: Chenliang Xu is an Associate Professor in the Department of Computer Science at the University of Rochester. His research originates in computer vision and tackles interdisciplinary topics, including video understanding, audio-visual learning, vision and language, and methods for trustworthy AI. He has authored over 90 peer-reviewed papers in computer vision, machine learning, multimedia, and AI venues.

Food Waste Classification with AI

1/3 of all food gets wasted, with millions of tons of food being thrown away each day. Food does not mean the same thing everywhere in the world, there are thousands of different meals across the world, therefore a lot of different classes to distinguish between. In this talk we’ll see through challenges of food-waste classification and see how foundation models can be useful to this task. We will also explore how we use FiftyOne to test models during development.

Speaker: Luka Posilović is a computer scientist with a PhD from FER, Zagreb, Croatia, working as a Head of machine learning in Kitro. Him and the team are trying to reduce the global food waste problem by using AI.

Objects and Image Geo-localization from Visual Data

Localizing images and objects from visual information stands out as one of the most challenging and dynamic topics in computer vision, owing to its broad applications across different domains. In this talk, we will introduce and delve into several research directions aimed at advancing solutions to these complex problems.

Speaker: Safwan Wshah is an Associate Professor in the Department of Computer Science at the University of Vermont. His research interests encompass the intersection of machine learning theory and application, with a particular emphasis on geo-localization from visual information. Additionally, he maintains broader interests in deep learning, computer vision, data analytics, and image processing.

Don’t Forget

  • Voxel51 will make a donation on behalf of the Meetup members to the charity that gets the most votes this month.
  • Can’t make the date and time? No problem! Just make sure to register here so we can send you links to the playbacks.
Feb 2024 – AI, Machine Learning & Data Science Meetup
James Le – Head of Developer Experience @ Twelve Labs

The evolution of video understanding has followed a similar trajectory to language and image understanding - with the rise of large pre-trained foundation models trained on a huge amount of data. Given the surge of multimodal research lately, video foundation models are becoming even more powerful to decipher the rich visual information embedded in videos. This talk will explore diverse use cases of video understanding and provide a glimpse of Twelve Labs offerings.

Feb 2024 – AI, Machine Learning & Data Science Meetup
PyData Bristol - 24th Meetup 2023-06-15 · 17:00

Join us once again for the next PyData Bristol Meetup! The PyData community in Bristol continues to thrive with engaging talks, insightful lightning sessions, and lively networking. A massive thank you to our generous hosts, Cookpad, for supplying the venue, Adlib for the scrumptious pizza, and thirst-quenching refreshments, along with additional sponsorship.

Agenda for the evening: 🚪 6:00 pm - Doors open 🕡 6:30 pm - Talks commence (sharp!)

📚 Two 25-minute talks:

  • Talk 1: Hiring for diversity - Dr Elena Hensinger-Schien
  • Talk 2: DBT - Data Build Tools - James Yarrow

📚 One 5-minute lightning talks:

  • Lightning Talk 1: Signal Processing with CNN - Michael Roberts
  • Managing a million tiles to (sort of) cure cancer - James Leech

📢 Community announcements 🤝 Relaxed networking over beers and soft drinks

Interested in sharing your knowledge or experience at this or a future event? Fill out this form to submit your talk proposal: https://goo.gl/forms/8lsz1WA1986Ahbbs1 We look forward to seeing you there for another fantastic evening of Python, Data Science, and camaraderie!

Talks

1. Introduction to DBT - James Yarrow

DBT or Data Build Tools is a python library with a huge open source community and so much more. DBT focuses on the Transformation in ELT and is used by analytics engineers, data engineers and data scientist. It can be used to managed models in SQL server with great python extensibility and governance features.

2. Inviting diversity when hiring for data jobs - Dr Elena Hensinger-Schien

When looking to build a diverse team, the job description, the interview task and the overall communication experience can be either inviting or off-putting for candidates. This talk will share some best practice, such as inclusive language and a critical assessment of our interview goal and approach, that will be beneficial not only to candidates from underrepresented groups within data, but all candidates.

Lightning Talks

1. Signal Processing with CNN - Michael Roberts

This talk will give a brief overview of a method to apply Convolutional Neural Networks to analysis of audio data through the use of the Discrete Fourier Transform. This approach is used in the current state-of-the-art algorithm for musical segment boundary detection, the subject of my MSc thesis.

2. Managing a million tiles to (sort of) cure cancer - James Leech

Medical computer vision is complicated because biopsy scans are huge. Think of it as a satellite map analysis tool, where you are trying to read the road marking to determine what country you’re in. It’s a subclass of image analysis called multiple instance learning. In this lightning talk I want to introduce the specific challenges associated with medical computer vision, some solutions I’ve learned thus far, and some gaps in current technology you could fill to make yourself rich.

🕖 LOGISTICS Talks kick off at 18:30 sharp; then networking in The Knights Templar from 20:40. If you realise you can't make it, please un-RSVP in good time to free up your place for your fellow community members. Follow @pydatabristol (https://twitter.com/pydatabristol) for updates on this and future events, as well as news from the global PyData community.

📜 CODE OF CONDUCT The PyData Code of Conduct governs this meetup (https://pydata.org/code-of-conduct/). To discuss any issues or concerns relating to the code of conduct or behaviour of anyone at the PyData meetup, please contact the PyData Bristol organisers, or you can submit a report of any potential Code of Conduct violation directly to NumFOCUS (https://numfocus.typeform.com/to/ynjGdT).

PyData Bristol - 24th Meetup