talk-data.com talk-data.com

Filter by Source

Select conferences and events

People (1 result)

Showing 9 results

Activities & events

Title & Speakers Event

Join our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Feb 5, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models

Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

Data-Centric Lessons To Improve Speech-Language Pretraining

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.

We focus on three research questions fundamental to speech-language pretraining data:

  • How to process raw web-crawled audio content for speech-text pretraining;
  • How to construct synthetic pretraining datasets to augment web-crawled data;
  • How to interleave (text, audio) segments into training sequences.

We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.

A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne

Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.

About the Speaker

Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow\, Docker\, and OpenCV. I started as a software developer\, moved into AI\, led teams\, and served as CTO. Today\, I connect code and community to build open\, production-ready AI\, making technology simple\, accessible\, and reliable.

Making Computer Vision Models Faster: An Introduction to TensorRT Optimization

Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.

About the Speaker

Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.

Feb 5 - AI, ML and Computer Vision Meetup

Join our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Feb 5, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models

Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

Data-Centric Lessons To Improve Speech-Language Pretraining

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.

We focus on three research questions fundamental to speech-language pretraining data:

  • How to process raw web-crawled audio content for speech-text pretraining;
  • How to construct synthetic pretraining datasets to augment web-crawled data;
  • How to interleave (text, audio) segments into training sequences.

We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.

A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne

Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.

About the Speaker

Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow\, Docker\, and OpenCV. I started as a software developer\, moved into AI\, led teams\, and served as CTO. Today\, I connect code and community to build open\, production-ready AI\, making technology simple\, accessible\, and reliable.

Making Computer Vision Models Faster: An Introduction to TensorRT Optimization

Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.

About the Speaker

Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.

Feb 5 - AI, ML and Computer Vision Meetup

Join our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Feb 5, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models

Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

Data-Centric Lessons To Improve Speech-Language Pretraining

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.

We focus on three research questions fundamental to speech-language pretraining data:

  • How to process raw web-crawled audio content for speech-text pretraining;
  • How to construct synthetic pretraining datasets to augment web-crawled data;
  • How to interleave (text, audio) segments into training sequences.

We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.

A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne

Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.

About the Speaker

Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow\, Docker\, and OpenCV. I started as a software developer\, moved into AI\, led teams\, and served as CTO. Today\, I connect code and community to build open\, production-ready AI\, making technology simple\, accessible\, and reliable.

Making Computer Vision Models Faster: An Introduction to TensorRT Optimization

Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.

About the Speaker

Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.

Feb 5 - AI, ML and Computer Vision Meetup

Join our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Feb 5, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models

Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

Data-Centric Lessons To Improve Speech-Language Pretraining

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.

We focus on three research questions fundamental to speech-language pretraining data:

  • How to process raw web-crawled audio content for speech-text pretraining;
  • How to construct synthetic pretraining datasets to augment web-crawled data;
  • How to interleave (text, audio) segments into training sequences.

We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.

A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne

Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.

About the Speaker

Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow\, Docker\, and OpenCV. I started as a software developer\, moved into AI\, led teams\, and served as CTO. Today\, I connect code and community to build open\, production-ready AI\, making technology simple\, accessible\, and reliable.

Making Computer Vision Models Faster: An Introduction to TensorRT Optimization

Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.

About the Speaker

Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.

Feb 5 - AI, ML and Computer Vision Meetup

Join our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Feb 5, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models

Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

Data-Centric Lessons To Improve Speech-Language Pretraining

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.

We focus on three research questions fundamental to speech-language pretraining data:

  • How to process raw web-crawled audio content for speech-text pretraining;
  • How to construct synthetic pretraining datasets to augment web-crawled data;
  • How to interleave (text, audio) segments into training sequences.

We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.

A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne

Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.

About the Speaker

Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow\, Docker\, and OpenCV. I started as a software developer\, moved into AI\, led teams\, and served as CTO. Today\, I connect code and community to build open\, production-ready AI\, making technology simple\, accessible\, and reliable.

Making Computer Vision Models Faster: An Introduction to TensorRT Optimization

Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.

About the Speaker

Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.

Feb 5 - AI, ML and Computer Vision Meetup

Join our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Feb 5, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models

Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

Data-Centric Lessons To Improve Speech-Language Pretraining

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.

We focus on three research questions fundamental to speech-language pretraining data:

  • How to process raw web-crawled audio content for speech-text pretraining;
  • How to construct synthetic pretraining datasets to augment web-crawled data;
  • How to interleave (text, audio) segments into training sequences.

We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.

A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne

Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.

About the Speaker

Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow\, Docker\, and OpenCV. I started as a software developer\, moved into AI\, led teams\, and served as CTO. Today\, I connect code and community to build open\, production-ready AI\, making technology simple\, accessible\, and reliable.

Making Computer Vision Models Faster: An Introduction to TensorRT Optimization

Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.

About the Speaker

Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.

Feb 5 - AI, ML and Computer Vision Meetup

Join our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Feb 5, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models

Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

Data-Centric Lessons To Improve Speech-Language Pretraining

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.

We focus on three research questions fundamental to speech-language pretraining data:

  • How to process raw web-crawled audio content for speech-text pretraining;
  • How to construct synthetic pretraining datasets to augment web-crawled data;
  • How to interleave (text, audio) segments into training sequences.

We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.

A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne

Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.

About the Speaker

Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow\, Docker\, and OpenCV. I started as a software developer\, moved into AI\, led teams\, and served as CTO. Today\, I connect code and community to build open\, production-ready AI\, making technology simple\, accessible\, and reliable.

Making Computer Vision Models Faster: An Introduction to TensorRT Optimization

Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.

About the Speaker

Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.

Feb 5 - AI, ML and Computer Vision Meetup

Join our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Feb 5, 2026 9 - 11 AM Pacific Online. Register for the Zoom!

Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models

Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

Data-Centric Lessons To Improve Speech-Language Pretraining

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.

We focus on three research questions fundamental to speech-language pretraining data:

  • How to process raw web-crawled audio content for speech-text pretraining;
  • How to construct synthetic pretraining datasets to augment web-crawled data;
  • How to interleave (text, audio) segments into training sequences.

We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.

A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne

Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.

About the Speaker

Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow\, Docker\, and OpenCV. I started as a software developer\, moved into AI\, led teams\, and served as CTO. Today\, I connect code and community to build open\, production-ready AI\, making technology simple\, accessible\, and reliable.

Making Computer Vision Models Faster: An Introduction to TensorRT Optimization

Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.

About the Speaker

Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.

Feb 5 - AI, ML and Computer Vision Meetup

Register for the event to reserve your spot!

Date and Time Feb 7, 2025 from 5:30 PM to 8:30 PM

Location The Meetup will take place at MotionLab.Berlin, Bouchéstraße 12/Halle 20 in Berlin

Smart Data Loops: A New Paradigm for AI Development and Anomaly Detection

In the era of autonomous driving, the quality and efficiency of AI development hinge on the ability to manage data intelligently. This talk introduces the concept of Smart Data Loop, a novel paradigm that revolutionizes data handling by improving out-of-distribution detection, and leveraging trigger functions to refine AI models continuously. We will explore how these innovative approaches enhance anomaly detection and streamline AI workflows.

About the Speaker

Dr. Azarm Nowzad holds a PhD in Computer Science and serves as the Technical Project Lead and Product Owner for “Data for AI” at Continental Automotive. She is currently leading the publicly funded project “justbetterDATA”, which focuses on developing efficient and highly accurate data generation methods for AI applications, particularly in the field of autonomous driving. With her expertise in computer vision and AI, she plays a pivotal role in advancing data-driven solutions for next-generation mobility.

All About Agentic AI

Today, the concept of Agentic AI is shaping how we think about intelligent systems. These are AI systems designed to act autonomously, making decisions, completing tasks, and interacting with their environment—beyond traditional AI models. Understanding how to design and develop Agentic AI products is essential for staying ahead in the competitive landscape of AI-driven innovation. In this talk, Dr. Arman Nassirtoussi introduces Agentic AI. He’ll cover how these systems differ from standard AI, the evolving architectures that support them, and why they’re becoming critical.

About the Speaker

Dr. Arman Nassirtoussi earned his PhD in AI over a decade ago, focusing on predictive AI algorithms for intraday financial trading using Natural Language Processing (NLP), sentiment analysis, and text mining of online news. His main publication has quickly received over 1,200 citations on Google Scholar. Arman has led large data engineering, data science, and AI teams at companies like Henkel, Zalando, and T-Systems, helping build infrastructure, platforms, and products with a major focus on personalization and product analytics in e-commerce. Arman has also created a number of startups in multiple countries, and he is currently shaping a new one in the Agentic AI space.

Bridging Minds and Machines: Aligning Human Behavior and Machine Algorithm

As AI systems increasingly support human decision-making, integrating human-centered design principles into ML engineering has become essential. This talk bridges the foundational concepts of Human-Computer Interaction (HCI) with the complex demands of algorithmic decision-making, focusing on bidirectional Human-AI alignment, trust calibration, and Reciprocal Human-Machine Learning (RHML).

We explore the necessity of embedding human behavior and neurocognitive feedback loops into ML pipelines to enable adaptive and trustworthy systems. Addressing overtrust, undertrust, and trust miscalibration, we emphasize aligning ML systems with both high-performance metrics and user behavior, ensuring systems are effective and ethically aligned.

About the Speaker

Anke Borchers is an AI Strategist and Consultant specializing in Machine Learning (ML), Generative AI, and Trustworthy AI. With a background in Industrial and Communication Design and over 15 years of experience in innovation and business strategy, she bridges the gap between human-centered design and advanced AI systems.

Dedicated to crafting tailored solutions for the medical and business sectors, Anke highlights the critical importance of human-centered AI systems. She offers deep expertise in cognitive and machine decision-making, as well as AI Alignment, empowering organizations to develop AI solutions that are high-performing, ethically sound, and optimized to address user needs effectively.

Attention is All We Need: Using Transformers in Vision Tasks

Attention mechanism, initially developed for natural language processing, is now being effectively applied in Computer Vision. This talk will focus on how attention enables Visual Transformers to capture context and why they are overpowering the classical approaches to vision tasks.

About the Speaker

Kira Kravets is a Machine Learning engineer at Kertos, specializing in LLMs and the development of trustworthy AI systems. With experience in Computer Vision, particularly in the highly demanding medical field, she is passionate about building real-world AI applications with all the limitations and restrictions of production environments.

Feb 7 - Berlin AI, ML and Computer Vision Meetup
Showing 9 results