Search – talk-data.com

AI Builders Amsterdam :: Pizza, Demos & Networking (paid event) 2026-01-29 · 16:30

🎟️ Get tickets: https://lu.ma/ai-builders 🎟️ ☝️This is a paid meetup (€20 - €10), Luma ticket is required!

Join our Monthly AI meetup Practical Demos & Technical Talks about building with LLMs and any Gen-AI model.

:: FOR WHO :: ✅ Anyone actively building with Generative AI ✅ Devs, Product peeps, Data lovers, ML engineers, Founders ⚠️Technical LLM knowledge required!* :: FORMAT :: 💻 ⚡️ Speed Demos (10 min): Builders sharing real-world AI solutions including their breakthrough code, diagrams and prompts !

🎤 🦄 Pioneer Talks (20 min): Inspiring talk or demo from emerging Gen-AI leaders in Europe or Silicon Valley

🤝🍕🍻 Fun Vibes: Lots of time to connect with other builders over some yummy pizza & drinks.

:: AGENDA :: 17:30 🤝 Drinks & Networking 18:00 🍕 Pizza (be early!) 18:30 🎤 🦄 Pioneer Talk (20m) --- Break --- 19:30 💻 ⚡️ Demos (4 x 10m) 20.10 🍻 Drinks & Networking 21.00 End

:: FAQ :: • What's AI Builders? We're a self-organizing nonprofit community of 3000+ AI Nerds in Europe. Yes.. we're building our own AI CEO.

• Why do I need to pay? 1). So we know how many people will come (max capacity of the space and reduce food waste) 2). Sponsor money doesn't cover all of our costs yet.

• Can I get a free ticket? Can I volunteer as co-host? Message Cristian (+31636420602) if we still need co-hosts or request a free ticket. Co-hosts arrive 1.5h early and help set-up the event or welcome people.

• *I'm not technical. Can I come? Yes, but to enjoy the meetup, we recommend to learn about these LLM Concepts: Multimodal, Vector Embeddings, RAG, Chaining, Structured Output, Function Calling, API calls, Knowledge Graphs, Reinforcement Learning, Fine-tuning, Agents. Additionally: Computer Vision, Diffusion Models, DevOps, MLOps.

• Why go to AI Meetups?

Networking Opportunities: Connect with 80+ technical attendees and a larger community of 3000+ in Europe
The latest practical insights: We focus on demo's and slides with diagrams, code & more.
Good vibes: We're an informal community, so enough fun and shenanigans.

🎟️ Get tickets: https://lu.ma/ai-builders 🎟️ ☝️This is a paid meetup (€20 - €10), Luma ticket is required!

AI Builders Amsterdam :: Pizza, Demos & Networking (paid event)

Building Agentic AI: Workflows, Fine-Tuning, Optimization, and Deployment 2025-12-08

Sinan Ozdemir – author

Transform Your Business with Intelligent AI to Drive Outcomes Building reactive AI applications and chatbots is no longer enough. The competitive advantage belongs to those who can build AI that can respond, reason, plan, and execute. Building Agentic AI: Workflows, Fine-Tuning, Optimization, and Deployment takes you beyond basic chatbots to create fully functional, autonomous agents that automate real workflows, enhance human decision-making, and drive measurable business outcomes across high-impact domains like customer support, finance, and research. Whether you're a developer deploying your first model, a data scientist exploring multi-agent systems and distilled LLMs, or a product manager integrating AI workflows and embedding models, this practical handbook provides tried and tested blueprints for building production-ready systems. Harness the power of reasoning models for applications like computer use, multimodal systems to work with all kinds of data, and fine-tuning techniques to get the most out of AI. Learn to test, monitor, and optimize agentic systems to keep them reliable and cost-effective at enterprise scale. Master the complete agentic AI pipeline Design adaptive AI agents with memory, tool use, and collaborative reasoning capabilities Build robust RAG workflows using embeddings, vector databases, and LangGraph state management Implement comprehensive evaluation frameworks beyond accuracy, including precision, recall, and latency metrics Deploy multimodal AI systems that seamlessly integrate text, vision, audio, and code generation Optimize models for production through fine-tuning, quantization, and speculative decoding techniques Navigate the bleeding edge of reasoning LLMs and computer-use capabilities Balance cost, speed, accuracy, and privacy in real-world deployment scenarios Create hybrid architectures that combine multiple agents for complex enterprise applications Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

data ai-ml artificial-intelligence-ai generative-ai AI/ML LLM RAG Vector DB

O'Reilly AI & ML Books

Autocurator: AI-Powered Knowledge Extraction for RAG Systems 2025-11-27 · 19:45

Marco P. Abrate – Visiting AI Engineer @ BCG X , Eleonora Vardè – Lead Data Scientist @ BCG X Milan

In the era of information overload, organizations struggle to harness the vast amount of unstructured data stored across presentations, reports, images, and text documents. That's why we created the "Autocurator", an AI-powered tool designed to automatically extract, structure, and curate knowledge from heterogeneous document repositories to support Retrieval-Augmented Generation (RAG) systems. Autocurator integrates advanced document parsing pipelines, multimodal AI models, and semantic structuring techniques to convert diverse content - including text, slides, tables, and diagrams - into machine-readable knowledge. This enables downstream RAG systems to query not only text-based insights but also visual and conceptual knowledge that traditionally remained inaccessible. Our system employs a multi-stage pipeline: (1) document ingestion and format normalization, (2) de-duplication of redundant and conflicting information (3) multimodal content understanding using large language and vision models, (4) entity and relationship extraction with human-in-the-loop validation, and (5) generation of structured outputs optimized for retrieval. We will showcase Autocurator’s effectiveness on large enterprise document corpora, showcasing significant gains in retrieval precision and generative quality across several applied AI use cases. By bridging unstructured data and structured knowledge, Autocurator provides a scalable and transparent foundation for next-generation knowledge management and reasoning systems.

AI/ML RAG

Assessing Risk of Extreme Events & Knowledge Extraction for RAG Systems

Oct 30 - AI, ML and Computer Vision Meetup 2025-10-30 · 16:00

Join the virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Oct 30, 2025 9 AM Pacific Online. Register for the Zoom!

The Agent Factory: Building a Platform for Enterprise-Wide AI Automation

In this talk we will explore what it takes to build an enterprise-ready AI automation platform at scale. The topics covered will include:

The Scale Challenge: E-commerce environments expose the limitations of single-point AI solutions, which create fragmented ecosystems lacking cohesion and efficient resource sharing across complex, knowledge-based work.
Root Cause Analysis Success: Flipkart’s initial AI agent transformed business analysis from days-long investigations to near-instantaneous insights, proving the concept while revealing broader platform opportunities.
Platform Strategy Evolution: Success across Engineering (SDLC, SRE), Operations, and Commerce teams necessitated a unified, multi-tenant platform serving diverse use cases with consistency and operational efficiency.
Architectural Foundation: Leveraging framework-agnostic design principles we were able to emphasize modularity, which enabled teams to leverage different AI models while maintaining consistent interfaces and scalable infrastructure.
The “Agent Garden” Vision: Flipkart’s roadmap envisions an internal ecosystem where teams discover, deploy, and contribute AI agents, providing a practical blueprint for scalable AI agent infrastructure development.

About the Speaker

Virender Bhargav at Flipkart is a seasoned engineering leader whose expertise spans business technology integration, enterprise applications, system design/architecture, and building highly scalable systems. With a deep understanding of technology, he has spearheaded teams, modernized technology landscapes, and managed core platform layers and strategic products. With extensive experience driving innovation at companies like Paytm and Flipkart, his contributions have left a lasting impact on the industry.

Scaling Generative Models at Scale with Ray and PyTorch

Generative image models like Stable Diffusion have opened up exciting possibilities for personalization, creativity, and scalable deployment. However, fine-tuning them in production‐grade settings poses challenges: managing compute, hyperparameters, model size, data, and distributed coordination are nontrivial.

In this talk, we’ll dive deep into learning how to fine-tune Stable Diffusion models using Ray Train (with HuggingFace Diffusers), including approaches like DreamBooth and LoRA. We’ll cover what works (and what doesn’t) in scaling out training jobs, handling large data, optimizing for GPU memory and speed, and validating outputs. Attendees will come away with practical insights and patterns they can use to fine-tune generative models in their own work.

About the Speaker

Suman Debnath is a Technical Lead (ML) at Anyscale, where he focuses on distributed training, fine-tuning, and inference optimization at scale on the cloud. His work centers around building and optimizing end-to-end machine learning workflows powered by distributed computing framework like Ray, enabling scalable and efficient ML systems. Suman’s expertise spans Natural Language Processing (NLP), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG). Earlier in his career, he developed performance benchmarking and monitoring tools for distributed storage systems. Beyond engineering, Suman is an active community contributor, having spoken at over 100 global conferences and events, including PyCon, PyData, ODSC, AIE and numerous meetups worldwide.

Privacy-preserving in Computer Vision through Optics Learning

Cameras are now ubiquitous, powering computer vision systems that assist us in everyday tasks and critical settings such as operating rooms. Yet, their widespread use raises serious privacy concerns: traditional cameras are designed to capture high-resolution images, making it easy to identify sensitive attributes such as faces, nudity, or personal objects. Once acquired, such data can be misused if accessed by adversaries. Existing software-based privacy mechanisms, such as blurring or pixelation, often degrade task performance and leave vulnerabilities in the processing pipeline.

In this talk, we explore an alternative question: how can we preserve privacy before or during image acquisition? By revisiting the image formation model, we show how camera optics themselves can be learned and optimized to acquire images that are unintelligible to humans yet remain useful for downstream vision tasks like action recognition. We will discuss recent approaches to learning camera lenses that intentionally produce privacy-preserving images, blurry and unrecognizable to the human eye, but still effective for machine perception. This paradigm shift opens the door to a new generation of cameras that embed privacy directly into their hardware design.

About the Speaker

Carlos Hinojosa is a Postdoctoral researcher at King Abdullah University of Science and Technology (KAUST) working with Prof. Bernard Ghanem. His research interests span Computer Vision, Machine Learning, AI Safety, and AI for Science. He focuses on developing safe, accurate, and efficient vision systems and machine-learning models that can reliably perceive, understand, and act on information, while ensuring robustness, protecting privacy, and aligning with societal values.

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Can we match vision and language embeddings without any supervision? According to the platonic representation hypothesis, as model and dataset scales increase, distances between corresponding representations are becoming similar in both embedding spaces. Our study demonstrates that pairwise distances are often sufficient to enable unsupervised matching, allowing vision-language correspondences to be discovered without any parallel data.

About the Speaker

Dominik Schnaus is a third-year Ph.D. student in the Computer Vision Group at the Technical University of Munich (TUM), supervised by Daniel Cremers. His research centers on multimodal and self-supervised learning with a special emphasis on understanding similarities across embedding spaces of different modalities.

Oct 30 - AI, ML and Computer Vision Meetup

Oct 30 - AI, ML and Computer Vision Meetup 2025-10-30 · 16:00

Join the virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Oct 30, 2025 9 AM Pacific Online. Register for the Zoom!

The Agent Factory: Building a Platform for Enterprise-Wide AI Automation

In this talk we will explore what it takes to build an enterprise-ready AI automation platform at scale. The topics covered will include:

The Scale Challenge: E-commerce environments expose the limitations of single-point AI solutions, which create fragmented ecosystems lacking cohesion and efficient resource sharing across complex, knowledge-based work.
Root Cause Analysis Success: Flipkart’s initial AI agent transformed business analysis from days-long investigations to near-instantaneous insights, proving the concept while revealing broader platform opportunities.
Platform Strategy Evolution: Success across Engineering (SDLC, SRE), Operations, and Commerce teams necessitated a unified, multi-tenant platform serving diverse use cases with consistency and operational efficiency.
Architectural Foundation: Leveraging framework-agnostic design principles we were able to emphasize modularity, which enabled teams to leverage different AI models while maintaining consistent interfaces and scalable infrastructure.
The “Agent Garden” Vision: Flipkart’s roadmap envisions an internal ecosystem where teams discover, deploy, and contribute AI agents, providing a practical blueprint for scalable AI agent infrastructure development.

About the Speaker

Virender Bhargav at Flipkart is a seasoned engineering leader whose expertise spans business technology integration, enterprise applications, system design/architecture, and building highly scalable systems. With a deep understanding of technology, he has spearheaded teams, modernized technology landscapes, and managed core platform layers and strategic products. With extensive experience driving innovation at companies like Paytm and Flipkart, his contributions have left a lasting impact on the industry.

Scaling Generative Models at Scale with Ray and PyTorch

Generative image models like Stable Diffusion have opened up exciting possibilities for personalization, creativity, and scalable deployment. However, fine-tuning them in production‐grade settings poses challenges: managing compute, hyperparameters, model size, data, and distributed coordination are nontrivial.

In this talk, we’ll dive deep into learning how to fine-tune Stable Diffusion models using Ray Train (with HuggingFace Diffusers), including approaches like DreamBooth and LoRA. We’ll cover what works (and what doesn’t) in scaling out training jobs, handling large data, optimizing for GPU memory and speed, and validating outputs. Attendees will come away with practical insights and patterns they can use to fine-tune generative models in their own work.

About the Speaker

Suman Debnath is a Technical Lead (ML) at Anyscale, where he focuses on distributed training, fine-tuning, and inference optimization at scale on the cloud. His work centers around building and optimizing end-to-end machine learning workflows powered by distributed computing framework like Ray, enabling scalable and efficient ML systems. Suman’s expertise spans Natural Language Processing (NLP), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG). Earlier in his career, he developed performance benchmarking and monitoring tools for distributed storage systems. Beyond engineering, Suman is an active community contributor, having spoken at over 100 global conferences and events, including PyCon, PyData, ODSC, AIE and numerous meetups worldwide.

Privacy-preserving in Computer Vision through Optics Learning

Cameras are now ubiquitous, powering computer vision systems that assist us in everyday tasks and critical settings such as operating rooms. Yet, their widespread use raises serious privacy concerns: traditional cameras are designed to capture high-resolution images, making it easy to identify sensitive attributes such as faces, nudity, or personal objects. Once acquired, such data can be misused if accessed by adversaries. Existing software-based privacy mechanisms, such as blurring or pixelation, often degrade task performance and leave vulnerabilities in the processing pipeline.

In this talk, we explore an alternative question: how can we preserve privacy before or during image acquisition? By revisiting the image formation model, we show how camera optics themselves can be learned and optimized to acquire images that are unintelligible to humans yet remain useful for downstream vision tasks like action recognition. We will discuss recent approaches to learning camera lenses that intentionally produce privacy-preserving images, blurry and unrecognizable to the human eye, but still effective for machine perception. This paradigm shift opens the door to a new generation of cameras that embed privacy directly into their hardware design.

About the Speaker

Carlos Hinojosa is a Postdoctoral researcher at King Abdullah University of Science and Technology (KAUST) working with Prof. Bernard Ghanem. His research interests span Computer Vision, Machine Learning, AI Safety, and AI for Science. He focuses on developing safe, accurate, and efficient vision systems and machine-learning models that can reliably perceive, understand, and act on information, while ensuring robustness, protecting privacy, and aligning with societal values.

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Can we match vision and language embeddings without any supervision? According to the platonic representation hypothesis, as model and dataset scales increase, distances between corresponding representations are becoming similar in both embedding spaces. Our study demonstrates that pairwise distances are often sufficient to enable unsupervised matching, allowing vision-language correspondences to be discovered without any parallel data.

About the Speaker

Dominik Schnaus is a third-year Ph.D. student in the Computer Vision Group at the Technical University of Munich (TUM), supervised by Daniel Cremers. His research centers on multimodal and self-supervised learning with a special emphasis on understanding similarities across embedding spaces of different modalities.

Oct 30 - AI, ML and Computer Vision Meetup

Oct 30 - AI, ML and Computer Vision Meetup 2025-10-30 · 16:00

Join the virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Oct 30, 2025 9 AM Pacific Online. Register for the Zoom!

The Agent Factory: Building a Platform for Enterprise-Wide AI Automation

In this talk we will explore what it takes to build an enterprise-ready AI automation platform at scale. The topics covered will include:

The Scale Challenge: E-commerce environments expose the limitations of single-point AI solutions, which create fragmented ecosystems lacking cohesion and efficient resource sharing across complex, knowledge-based work.
Root Cause Analysis Success: Flipkart’s initial AI agent transformed business analysis from days-long investigations to near-instantaneous insights, proving the concept while revealing broader platform opportunities.
Platform Strategy Evolution: Success across Engineering (SDLC, SRE), Operations, and Commerce teams necessitated a unified, multi-tenant platform serving diverse use cases with consistency and operational efficiency.
Architectural Foundation: Leveraging framework-agnostic design principles we were able to emphasize modularity, which enabled teams to leverage different AI models while maintaining consistent interfaces and scalable infrastructure.
The “Agent Garden” Vision: Flipkart’s roadmap envisions an internal ecosystem where teams discover, deploy, and contribute AI agents, providing a practical blueprint for scalable AI agent infrastructure development.

About the Speaker

Virender Bhargav at Flipkart is a seasoned engineering leader whose expertise spans business technology integration, enterprise applications, system design/architecture, and building highly scalable systems. With a deep understanding of technology, he has spearheaded teams, modernized technology landscapes, and managed core platform layers and strategic products. With extensive experience driving innovation at companies like Paytm and Flipkart, his contributions have left a lasting impact on the industry.

Scaling Generative Models at Scale with Ray and PyTorch

Generative image models like Stable Diffusion have opened up exciting possibilities for personalization, creativity, and scalable deployment. However, fine-tuning them in production‐grade settings poses challenges: managing compute, hyperparameters, model size, data, and distributed coordination are nontrivial.

In this talk, we’ll dive deep into learning how to fine-tune Stable Diffusion models using Ray Train (with HuggingFace Diffusers), including approaches like DreamBooth and LoRA. We’ll cover what works (and what doesn’t) in scaling out training jobs, handling large data, optimizing for GPU memory and speed, and validating outputs. Attendees will come away with practical insights and patterns they can use to fine-tune generative models in their own work.

About the Speaker

Suman Debnath is a Technical Lead (ML) at Anyscale, where he focuses on distributed training, fine-tuning, and inference optimization at scale on the cloud. His work centers around building and optimizing end-to-end machine learning workflows powered by distributed computing framework like Ray, enabling scalable and efficient ML systems. Suman’s expertise spans Natural Language Processing (NLP), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG). Earlier in his career, he developed performance benchmarking and monitoring tools for distributed storage systems. Beyond engineering, Suman is an active community contributor, having spoken at over 100 global conferences and events, including PyCon, PyData, ODSC, AIE and numerous meetups worldwide.

Privacy-preserving in Computer Vision through Optics Learning

Cameras are now ubiquitous, powering computer vision systems that assist us in everyday tasks and critical settings such as operating rooms. Yet, their widespread use raises serious privacy concerns: traditional cameras are designed to capture high-resolution images, making it easy to identify sensitive attributes such as faces, nudity, or personal objects. Once acquired, such data can be misused if accessed by adversaries. Existing software-based privacy mechanisms, such as blurring or pixelation, often degrade task performance and leave vulnerabilities in the processing pipeline.

In this talk, we explore an alternative question: how can we preserve privacy before or during image acquisition? By revisiting the image formation model, we show how camera optics themselves can be learned and optimized to acquire images that are unintelligible to humans yet remain useful for downstream vision tasks like action recognition. We will discuss recent approaches to learning camera lenses that intentionally produce privacy-preserving images, blurry and unrecognizable to the human eye, but still effective for machine perception. This paradigm shift opens the door to a new generation of cameras that embed privacy directly into their hardware design.

About the Speaker

Carlos Hinojosa is a Postdoctoral researcher at King Abdullah University of Science and Technology (KAUST) working with Prof. Bernard Ghanem. His research interests span Computer Vision, Machine Learning, AI Safety, and AI for Science. He focuses on developing safe, accurate, and efficient vision systems and machine-learning models that can reliably perceive, understand, and act on information, while ensuring robustness, protecting privacy, and aligning with societal values.

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Can we match vision and language embeddings without any supervision? According to the platonic representation hypothesis, as model and dataset scales increase, distances between corresponding representations are becoming similar in both embedding spaces. Our study demonstrates that pairwise distances are often sufficient to enable unsupervised matching, allowing vision-language correspondences to be discovered without any parallel data.

About the Speaker

Dominik Schnaus is a third-year Ph.D. student in the Computer Vision Group at the Technical University of Munich (TUM), supervised by Daniel Cremers. His research centers on multimodal and self-supervised learning with a special emphasis on understanding similarities across embedding spaces of different modalities.

Oct 30 - AI, ML and Computer Vision Meetup

Oct 30 - AI, ML and Computer Vision Meetup 2025-10-30 · 16:00

Join the virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Oct 30, 2025 9 AM Pacific Online. Register for the Zoom!

The Agent Factory: Building a Platform for Enterprise-Wide AI Automation

In this talk we will explore what it takes to build an enterprise-ready AI automation platform at scale. The topics covered will include:

The Scale Challenge: E-commerce environments expose the limitations of single-point AI solutions, which create fragmented ecosystems lacking cohesion and efficient resource sharing across complex, knowledge-based work.
Root Cause Analysis Success: Flipkart’s initial AI agent transformed business analysis from days-long investigations to near-instantaneous insights, proving the concept while revealing broader platform opportunities.
Platform Strategy Evolution: Success across Engineering (SDLC, SRE), Operations, and Commerce teams necessitated a unified, multi-tenant platform serving diverse use cases with consistency and operational efficiency.
Architectural Foundation: Leveraging framework-agnostic design principles we were able to emphasize modularity, which enabled teams to leverage different AI models while maintaining consistent interfaces and scalable infrastructure.
The “Agent Garden” Vision: Flipkart’s roadmap envisions an internal ecosystem where teams discover, deploy, and contribute AI agents, providing a practical blueprint for scalable AI agent infrastructure development.

About the Speaker

Virender Bhargav at Flipkart is a seasoned engineering leader whose expertise spans business technology integration, enterprise applications, system design/architecture, and building highly scalable systems. With a deep understanding of technology, he has spearheaded teams, modernized technology landscapes, and managed core platform layers and strategic products. With extensive experience driving innovation at companies like Paytm and Flipkart, his contributions have left a lasting impact on the industry.

Scaling Generative Models at Scale with Ray and PyTorch

Generative image models like Stable Diffusion have opened up exciting possibilities for personalization, creativity, and scalable deployment. However, fine-tuning them in production‐grade settings poses challenges: managing compute, hyperparameters, model size, data, and distributed coordination are nontrivial.

In this talk, we’ll dive deep into learning how to fine-tune Stable Diffusion models using Ray Train (with HuggingFace Diffusers), including approaches like DreamBooth and LoRA. We’ll cover what works (and what doesn’t) in scaling out training jobs, handling large data, optimizing for GPU memory and speed, and validating outputs. Attendees will come away with practical insights and patterns they can use to fine-tune generative models in their own work.

About the Speaker

Suman Debnath is a Technical Lead (ML) at Anyscale, where he focuses on distributed training, fine-tuning, and inference optimization at scale on the cloud. His work centers around building and optimizing end-to-end machine learning workflows powered by distributed computing framework like Ray, enabling scalable and efficient ML systems. Suman’s expertise spans Natural Language Processing (NLP), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG). Earlier in his career, he developed performance benchmarking and monitoring tools for distributed storage systems. Beyond engineering, Suman is an active community contributor, having spoken at over 100 global conferences and events, including PyCon, PyData, ODSC, AIE and numerous meetups worldwide.

Privacy-preserving in Computer Vision through Optics Learning

Cameras are now ubiquitous, powering computer vision systems that assist us in everyday tasks and critical settings such as operating rooms. Yet, their widespread use raises serious privacy concerns: traditional cameras are designed to capture high-resolution images, making it easy to identify sensitive attributes such as faces, nudity, or personal objects. Once acquired, such data can be misused if accessed by adversaries. Existing software-based privacy mechanisms, such as blurring or pixelation, often degrade task performance and leave vulnerabilities in the processing pipeline.

In this talk, we explore an alternative question: how can we preserve privacy before or during image acquisition? By revisiting the image formation model, we show how camera optics themselves can be learned and optimized to acquire images that are unintelligible to humans yet remain useful for downstream vision tasks like action recognition. We will discuss recent approaches to learning camera lenses that intentionally produce privacy-preserving images, blurry and unrecognizable to the human eye, but still effective for machine perception. This paradigm shift opens the door to a new generation of cameras that embed privacy directly into their hardware design.

About the Speaker

Carlos Hinojosa is a Postdoctoral researcher at King Abdullah University of Science and Technology (KAUST) working with Prof. Bernard Ghanem. His research interests span Computer Vision, Machine Learning, AI Safety, and AI for Science. He focuses on developing safe, accurate, and efficient vision systems and machine-learning models that can reliably perceive, understand, and act on information, while ensuring robustness, protecting privacy, and aligning with societal values.

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Can we match vision and language embeddings without any supervision? According to the platonic representation hypothesis, as model and dataset scales increase, distances between corresponding representations are becoming similar in both embedding spaces. Our study demonstrates that pairwise distances are often sufficient to enable unsupervised matching, allowing vision-language correspondences to be discovered without any parallel data.

About the Speaker

Dominik Schnaus is a third-year Ph.D. student in the Computer Vision Group at the Technical University of Munich (TUM), supervised by Daniel Cremers. His research centers on multimodal and self-supervised learning with a special emphasis on understanding similarities across embedding spaces of different modalities.

Oct 30 - AI, ML and Computer Vision Meetup

Oct 30 - AI, ML and Computer Vision Meetup 2025-10-30 · 16:00

Join the virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Oct 30, 2025 9 AM Pacific Online. Register for the Zoom!

The Agent Factory: Building a Platform for Enterprise-Wide AI Automation

In this talk we will explore what it takes to build an enterprise-ready AI automation platform at scale. The topics covered will include:

The Scale Challenge: E-commerce environments expose the limitations of single-point AI solutions, which create fragmented ecosystems lacking cohesion and efficient resource sharing across complex, knowledge-based work.
Root Cause Analysis Success: Flipkart’s initial AI agent transformed business analysis from days-long investigations to near-instantaneous insights, proving the concept while revealing broader platform opportunities.
Platform Strategy Evolution: Success across Engineering (SDLC, SRE), Operations, and Commerce teams necessitated a unified, multi-tenant platform serving diverse use cases with consistency and operational efficiency.
Architectural Foundation: Leveraging framework-agnostic design principles we were able to emphasize modularity, which enabled teams to leverage different AI models while maintaining consistent interfaces and scalable infrastructure.
The “Agent Garden” Vision: Flipkart’s roadmap envisions an internal ecosystem where teams discover, deploy, and contribute AI agents, providing a practical blueprint for scalable AI agent infrastructure development.

About the Speaker

Virender Bhargav at Flipkart is a seasoned engineering leader whose expertise spans business technology integration, enterprise applications, system design/architecture, and building highly scalable systems. With a deep understanding of technology, he has spearheaded teams, modernized technology landscapes, and managed core platform layers and strategic products. With extensive experience driving innovation at companies like Paytm and Flipkart, his contributions have left a lasting impact on the industry.

Scaling Generative Models at Scale with Ray and PyTorch

Generative image models like Stable Diffusion have opened up exciting possibilities for personalization, creativity, and scalable deployment. However, fine-tuning them in production‐grade settings poses challenges: managing compute, hyperparameters, model size, data, and distributed coordination are nontrivial.

In this talk, we’ll dive deep into learning how to fine-tune Stable Diffusion models using Ray Train (with HuggingFace Diffusers), including approaches like DreamBooth and LoRA. We’ll cover what works (and what doesn’t) in scaling out training jobs, handling large data, optimizing for GPU memory and speed, and validating outputs. Attendees will come away with practical insights and patterns they can use to fine-tune generative models in their own work.

About the Speaker

Suman Debnath is a Technical Lead (ML) at Anyscale, where he focuses on distributed training, fine-tuning, and inference optimization at scale on the cloud. His work centers around building and optimizing end-to-end machine learning workflows powered by distributed computing framework like Ray, enabling scalable and efficient ML systems. Suman’s expertise spans Natural Language Processing (NLP), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG). Earlier in his career, he developed performance benchmarking and monitoring tools for distributed storage systems. Beyond engineering, Suman is an active community contributor, having spoken at over 100 global conferences and events, including PyCon, PyData, ODSC, AIE and numerous meetups worldwide.

Privacy-preserving in Computer Vision through Optics Learning

Cameras are now ubiquitous, powering computer vision systems that assist us in everyday tasks and critical settings such as operating rooms. Yet, their widespread use raises serious privacy concerns: traditional cameras are designed to capture high-resolution images, making it easy to identify sensitive attributes such as faces, nudity, or personal objects. Once acquired, such data can be misused if accessed by adversaries. Existing software-based privacy mechanisms, such as blurring or pixelation, often degrade task performance and leave vulnerabilities in the processing pipeline.

In this talk, we explore an alternative question: how can we preserve privacy before or during image acquisition? By revisiting the image formation model, we show how camera optics themselves can be learned and optimized to acquire images that are unintelligible to humans yet remain useful for downstream vision tasks like action recognition. We will discuss recent approaches to learning camera lenses that intentionally produce privacy-preserving images, blurry and unrecognizable to the human eye, but still effective for machine perception. This paradigm shift opens the door to a new generation of cameras that embed privacy directly into their hardware design.

About the Speaker

Carlos Hinojosa is a Postdoctoral researcher at King Abdullah University of Science and Technology (KAUST) working with Prof. Bernard Ghanem. His research interests span Computer Vision, Machine Learning, AI Safety, and AI for Science. He focuses on developing safe, accurate, and efficient vision systems and machine-learning models that can reliably perceive, understand, and act on information, while ensuring robustness, protecting privacy, and aligning with societal values.

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Can we match vision and language embeddings without any supervision? According to the platonic representation hypothesis, as model and dataset scales increase, distances between corresponding representations are becoming similar in both embedding spaces. Our study demonstrates that pairwise distances are often sufficient to enable unsupervised matching, allowing vision-language correspondences to be discovered without any parallel data.

About the Speaker

Dominik Schnaus is a third-year Ph.D. student in the Computer Vision Group at the Technical University of Munich (TUM), supervised by Daniel Cremers. His research centers on multimodal and self-supervised learning with a special emphasis on understanding similarities across embedding spaces of different modalities.

Oct 30 - AI, ML and Computer Vision Meetup

Oct 30 - AI, ML and Computer Vision Meetup 2025-10-30 · 16:00

Join the virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Oct 30, 2025 9 AM Pacific Online. Register for the Zoom!

The Agent Factory: Building a Platform for Enterprise-Wide AI Automation

In this talk we will explore what it takes to build an enterprise-ready AI automation platform at scale. The topics covered will include:

The Scale Challenge: E-commerce environments expose the limitations of single-point AI solutions, which create fragmented ecosystems lacking cohesion and efficient resource sharing across complex, knowledge-based work.
Root Cause Analysis Success: Flipkart’s initial AI agent transformed business analysis from days-long investigations to near-instantaneous insights, proving the concept while revealing broader platform opportunities.
Platform Strategy Evolution: Success across Engineering (SDLC, SRE), Operations, and Commerce teams necessitated a unified, multi-tenant platform serving diverse use cases with consistency and operational efficiency.
Architectural Foundation: Leveraging framework-agnostic design principles we were able to emphasize modularity, which enabled teams to leverage different AI models while maintaining consistent interfaces and scalable infrastructure.
The “Agent Garden” Vision: Flipkart’s roadmap envisions an internal ecosystem where teams discover, deploy, and contribute AI agents, providing a practical blueprint for scalable AI agent infrastructure development.

About the Speaker

Virender Bhargav at Flipkart is a seasoned engineering leader whose expertise spans business technology integration, enterprise applications, system design/architecture, and building highly scalable systems. With a deep understanding of technology, he has spearheaded teams, modernized technology landscapes, and managed core platform layers and strategic products. With extensive experience driving innovation at companies like Paytm and Flipkart, his contributions have left a lasting impact on the industry.

Scaling Generative Models at Scale with Ray and PyTorch

Generative image models like Stable Diffusion have opened up exciting possibilities for personalization, creativity, and scalable deployment. However, fine-tuning them in production‐grade settings poses challenges: managing compute, hyperparameters, model size, data, and distributed coordination are nontrivial.

In this talk, we’ll dive deep into learning how to fine-tune Stable Diffusion models using Ray Train (with HuggingFace Diffusers), including approaches like DreamBooth and LoRA. We’ll cover what works (and what doesn’t) in scaling out training jobs, handling large data, optimizing for GPU memory and speed, and validating outputs. Attendees will come away with practical insights and patterns they can use to fine-tune generative models in their own work.

About the Speaker

Suman Debnath is a Technical Lead (ML) at Anyscale, where he focuses on distributed training, fine-tuning, and inference optimization at scale on the cloud. His work centers around building and optimizing end-to-end machine learning workflows powered by distributed computing framework like Ray, enabling scalable and efficient ML systems. Suman’s expertise spans Natural Language Processing (NLP), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG). Earlier in his career, he developed performance benchmarking and monitoring tools for distributed storage systems. Beyond engineering, Suman is an active community contributor, having spoken at over 100 global conferences and events, including PyCon, PyData, ODSC, AIE and numerous meetups worldwide.

Privacy-preserving in Computer Vision through Optics Learning

Cameras are now ubiquitous, powering computer vision systems that assist us in everyday tasks and critical settings such as operating rooms. Yet, their widespread use raises serious privacy concerns: traditional cameras are designed to capture high-resolution images, making it easy to identify sensitive attributes such as faces, nudity, or personal objects. Once acquired, such data can be misused if accessed by adversaries. Existing software-based privacy mechanisms, such as blurring or pixelation, often degrade task performance and leave vulnerabilities in the processing pipeline.

In this talk, we explore an alternative question: how can we preserve privacy before or during image acquisition? By revisiting the image formation model, we show how camera optics themselves can be learned and optimized to acquire images that are unintelligible to humans yet remain useful for downstream vision tasks like action recognition. We will discuss recent approaches to learning camera lenses that intentionally produce privacy-preserving images, blurry and unrecognizable to the human eye, but still effective for machine perception. This paradigm shift opens the door to a new generation of cameras that embed privacy directly into their hardware design.

About the Speaker

Carlos Hinojosa is a Postdoctoral researcher at King Abdullah University of Science and Technology (KAUST) working with Prof. Bernard Ghanem. His research interests span Computer Vision, Machine Learning, AI Safety, and AI for Science. He focuses on developing safe, accurate, and efficient vision systems and machine-learning models that can reliably perceive, understand, and act on information, while ensuring robustness, protecting privacy, and aligning with societal values.

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Can we match vision and language embeddings without any supervision? According to the platonic representation hypothesis, as model and dataset scales increase, distances between corresponding representations are becoming similar in both embedding spaces. Our study demonstrates that pairwise distances are often sufficient to enable unsupervised matching, allowing vision-language correspondences to be discovered without any parallel data.

About the Speaker

Dominik Schnaus is a third-year Ph.D. student in the Computer Vision Group at the Technical University of Munich (TUM), supervised by Daniel Cremers. His research centers on multimodal and self-supervised learning with a special emphasis on understanding similarities across embedding spaces of different modalities.

Oct 30 - AI, ML and Computer Vision Meetup

Oct 30 - AI, ML and Computer Vision Meetup 2025-10-30 · 16:00

Join the virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Oct 30, 2025 9 AM Pacific Online. Register for the Zoom!

The Agent Factory: Building a Platform for Enterprise-Wide AI Automation

In this talk we will explore what it takes to build an enterprise-ready AI automation platform at scale. The topics covered will include:

The Scale Challenge: E-commerce environments expose the limitations of single-point AI solutions, which create fragmented ecosystems lacking cohesion and efficient resource sharing across complex, knowledge-based work.
Root Cause Analysis Success: Flipkart’s initial AI agent transformed business analysis from days-long investigations to near-instantaneous insights, proving the concept while revealing broader platform opportunities.
Platform Strategy Evolution: Success across Engineering (SDLC, SRE), Operations, and Commerce teams necessitated a unified, multi-tenant platform serving diverse use cases with consistency and operational efficiency.
Architectural Foundation: Leveraging framework-agnostic design principles we were able to emphasize modularity, which enabled teams to leverage different AI models while maintaining consistent interfaces and scalable infrastructure.
The “Agent Garden” Vision: Flipkart’s roadmap envisions an internal ecosystem where teams discover, deploy, and contribute AI agents, providing a practical blueprint for scalable AI agent infrastructure development.

About the Speaker

Virender Bhargav at Flipkart is a seasoned engineering leader whose expertise spans business technology integration, enterprise applications, system design/architecture, and building highly scalable systems. With a deep understanding of technology, he has spearheaded teams, modernized technology landscapes, and managed core platform layers and strategic products. With extensive experience driving innovation at companies like Paytm and Flipkart, his contributions have left a lasting impact on the industry.

Scaling Generative Models at Scale with Ray and PyTorch

Generative image models like Stable Diffusion have opened up exciting possibilities for personalization, creativity, and scalable deployment. However, fine-tuning them in production‐grade settings poses challenges: managing compute, hyperparameters, model size, data, and distributed coordination are nontrivial.

In this talk, we’ll dive deep into learning how to fine-tune Stable Diffusion models using Ray Train (with HuggingFace Diffusers), including approaches like DreamBooth and LoRA. We’ll cover what works (and what doesn’t) in scaling out training jobs, handling large data, optimizing for GPU memory and speed, and validating outputs. Attendees will come away with practical insights and patterns they can use to fine-tune generative models in their own work.

About the Speaker

Suman Debnath is a Technical Lead (ML) at Anyscale, where he focuses on distributed training, fine-tuning, and inference optimization at scale on the cloud. His work centers around building and optimizing end-to-end machine learning workflows powered by distributed computing framework like Ray, enabling scalable and efficient ML systems. Suman’s expertise spans Natural Language Processing (NLP), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG). Earlier in his career, he developed performance benchmarking and monitoring tools for distributed storage systems. Beyond engineering, Suman is an active community contributor, having spoken at over 100 global conferences and events, including PyCon, PyData, ODSC, AIE and numerous meetups worldwide.

Privacy-preserving in Computer Vision through Optics Learning

Cameras are now ubiquitous, powering computer vision systems that assist us in everyday tasks and critical settings such as operating rooms. Yet, their widespread use raises serious privacy concerns: traditional cameras are designed to capture high-resolution images, making it easy to identify sensitive attributes such as faces, nudity, or personal objects. Once acquired, such data can be misused if accessed by adversaries. Existing software-based privacy mechanisms, such as blurring or pixelation, often degrade task performance and leave vulnerabilities in the processing pipeline.

In this talk, we explore an alternative question: how can we preserve privacy before or during image acquisition? By revisiting the image formation model, we show how camera optics themselves can be learned and optimized to acquire images that are unintelligible to humans yet remain useful for downstream vision tasks like action recognition. We will discuss recent approaches to learning camera lenses that intentionally produce privacy-preserving images, blurry and unrecognizable to the human eye, but still effective for machine perception. This paradigm shift opens the door to a new generation of cameras that embed privacy directly into their hardware design.

About the Speaker

Carlos Hinojosa is a Postdoctoral researcher at King Abdullah University of Science and Technology (KAUST) working with Prof. Bernard Ghanem. His research interests span Computer Vision, Machine Learning, AI Safety, and AI for Science. He focuses on developing safe, accurate, and efficient vision systems and machine-learning models that can reliably perceive, understand, and act on information, while ensuring robustness, protecting privacy, and aligning with societal values.

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Can we match vision and language embeddings without any supervision? According to the platonic representation hypothesis, as model and dataset scales increase, distances between corresponding representations are becoming similar in both embedding spaces. Our study demonstrates that pairwise distances are often sufficient to enable unsupervised matching, allowing vision-language correspondences to be discovered without any parallel data.

About the Speaker

Dominik Schnaus is a third-year Ph.D. student in the Computer Vision Group at the Technical University of Munich (TUM), supervised by Daniel Cremers. His research centers on multimodal and self-supervised learning with a special emphasis on understanding similarities across embedding spaces of different modalities.

Oct 30 - AI, ML and Computer Vision Meetup

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click 2025-08-29 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 29, 2025 at 9 AM Pacific

Register for the Zoom

Part 3: Teaching Machines to See and Click - Model Finetuning

From Foundation Models to GUI Specialists

Foundation models, such as Qwen2.5-VL, demonstrate impressive visual understanding, but they require specialized training to master GUI interactions. In this final session, you'll transform a general-purpose vision-language model into a GUI specialist that can navigate interfaces with human-like precision.

We'll explore modern fine-tuning strategies specifically designed for GUI tasks, from selecting the right architecture to handling the unique challenges of coordinate prediction and multi-step reasoning. You'll implement training pipelines that can handle the diverse formats and platforms in your dataset, evaluate models on metrics that actually matter for GUI automation, and deploy your trained model in a real-world testing environment.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click 2025-08-29 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 29, 2025 at 9 AM Pacific

Register for the Zoom

Part 3: Teaching Machines to See and Click - Model Finetuning

From Foundation Models to GUI Specialists

Foundation models, such as Qwen2.5-VL, demonstrate impressive visual understanding, but they require specialized training to master GUI interactions. In this final session, you'll transform a general-purpose vision-language model into a GUI specialist that can navigate interfaces with human-like precision.

We'll explore modern fine-tuning strategies specifically designed for GUI tasks, from selecting the right architecture to handling the unique challenges of coordinate prediction and multi-step reasoning. You'll implement training pipelines that can handle the diverse formats and platforms in your dataset, evaluate models on metrics that actually matter for GUI automation, and deploy your trained model in a real-world testing environment.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click 2025-08-29 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 29, 2025 at 9 AM Pacific

Register for the Zoom

Part 3: Teaching Machines to See and Click - Model Finetuning

From Foundation Models to GUI Specialists

Foundation models, such as Qwen2.5-VL, demonstrate impressive visual understanding, but they require specialized training to master GUI interactions. In this final session, you'll transform a general-purpose vision-language model into a GUI specialist that can navigate interfaces with human-like precision.

We'll explore modern fine-tuning strategies specifically designed for GUI tasks, from selecting the right architecture to handling the unique challenges of coordinate prediction and multi-step reasoning. You'll implement training pipelines that can handle the diverse formats and platforms in your dataset, evaluate models on metrics that actually matter for GUI automation, and deploy your trained model in a real-world testing environment.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click 2025-08-29 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 29, 2025 at 9 AM Pacific

Register for the Zoom

Part 3: Teaching Machines to See and Click - Model Finetuning

From Foundation Models to GUI Specialists

Foundation models, such as Qwen2.5-VL, demonstrate impressive visual understanding, but they require specialized training to master GUI interactions. In this final session, you'll transform a general-purpose vision-language model into a GUI specialist that can navigate interfaces with human-like precision.

We'll explore modern fine-tuning strategies specifically designed for GUI tasks, from selecting the right architecture to handling the unique challenges of coordinate prediction and multi-step reasoning. You'll implement training pipelines that can handle the diverse formats and platforms in your dataset, evaluate models on metrics that actually matter for GUI automation, and deploy your trained model in a real-world testing environment.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click 2025-08-29 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 29, 2025 at 9 AM Pacific

Register for the Zoom

Part 3: Teaching Machines to See and Click - Model Finetuning

From Foundation Models to GUI Specialists

Foundation models, such as Qwen2.5-VL, demonstrate impressive visual understanding, but they require specialized training to master GUI interactions. In this final session, you'll transform a general-purpose vision-language model into a GUI specialist that can navigate interfaces with human-like precision.

We'll explore modern fine-tuning strategies specifically designed for GUI tasks, from selecting the right architecture to handling the unique challenges of coordinate prediction and multi-step reasoning. You'll implement training pipelines that can handle the diverse formats and platforms in your dataset, evaluate models on metrics that actually matter for GUI automation, and deploy your trained model in a real-world testing environment.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click 2025-08-29 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 29, 2025 at 9 AM Pacific

Register for the Zoom

Part 3: Teaching Machines to See and Click - Model Finetuning

From Foundation Models to GUI Specialists

Foundation models, such as Qwen2.5-VL, demonstrate impressive visual understanding, but they require specialized training to master GUI interactions. In this final session, you'll transform a general-purpose vision-language model into a GUI specialist that can navigate interfaces with human-like precision.

We'll explore modern fine-tuning strategies specifically designed for GUI tasks, from selecting the right architecture to handling the unique challenges of coordinate prediction and multi-step reasoning. You'll implement training pipelines that can handle the diverse formats and platforms in your dataset, evaluate models on metrics that actually matter for GUI automation, and deploy your trained model in a real-world testing environment.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click 2025-08-29 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 29, 2025 at 9 AM Pacific

Register for the Zoom

Part 3: Teaching Machines to See and Click - Model Finetuning

From Foundation Models to GUI Specialists

Foundation models, such as Qwen2.5-VL, demonstrate impressive visual understanding, but they require specialized training to master GUI interactions. In this final session, you'll transform a general-purpose vision-language model into a GUI specialist that can navigate interfaces with human-like precision.

We'll explore modern fine-tuning strategies specifically designed for GUI tasks, from selecting the right architecture to handling the unique challenges of coordinate prediction and multi-step reasoning. You'll implement training pipelines that can handle the diverse formats and platforms in your dataset, evaluate models on metrics that actually matter for GUI automation, and deploy your trained model in a real-world testing environment.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 29 - Visual Agents Workshop Part 3: Teaching Machines to See and Click

Aug 15 - Visual Agent Workshop Part 1: Navigating the GUI Agent Landscape 2025-08-15 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 15, 2025 at 9 AM Pacific

Register for the Zoom

Part 1: Navigating the GUI Agent Landscape

Understanding the Foundation Before Building

The GUI agent field is evolving rapidly, but success requires an understanding of what came before. In this opening session, we'll map the terrain of GUI agent research—from the early days of MiniWoB's simplified environments to today's complex, multimodal systems tackling real-world applications. You'll discover why standard vision models fail catastrophically on GUI tasks, explore the annotation bottlenecks that make GUI datasets so expensive to create, and understand the platform fragmentation that makes "click a button" mean twenty different things across datasets.

We'll dissect the most influential datasets (Mind2Web, AITW, Rico) and models that have shaped the field, examining their strengths, limitations, and the research gaps they reveal. By the end, you'll have a clear picture of where GUI agents excel, where they struggle, and, most importantly, where the opportunities lie for your own contributions.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 15 - Visual Agent Workshop Part 1: Navigating the GUI Agent Landscape

Aug 15 - Visual Agent Workshop Part 1: Navigating the GUI Agent Landscape 2025-08-15 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 15, 2025 at 9 AM Pacific

Register for the Zoom

Part 1: Navigating the GUI Agent Landscape

Understanding the Foundation Before Building

The GUI agent field is evolving rapidly, but success requires an understanding of what came before. In this opening session, we'll map the terrain of GUI agent research—from the early days of MiniWoB's simplified environments to today's complex, multimodal systems tackling real-world applications. You'll discover why standard vision models fail catastrophically on GUI tasks, explore the annotation bottlenecks that make GUI datasets so expensive to create, and understand the platform fragmentation that makes "click a button" mean twenty different things across datasets.

We'll dissect the most influential datasets (Mind2Web, AITW, Rico) and models that have shaped the field, examining their strengths, limitations, and the research gaps they reveal. By the end, you'll have a clear picture of where GUI agents excel, where they struggle, and, most importantly, where the opportunities lie for your own contributions.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 15 - Visual Agent Workshop Part 1: Navigating the GUI Agent Landscape

Aug 15 - Visual Agent Workshop Part 1: Navigating the GUI Agent Landscape 2025-08-15 · 16:00

Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 15, 2025 at 9 AM Pacific

Register for the Zoom

Part 1: Navigating the GUI Agent Landscape

Understanding the Foundation Before Building

The GUI agent field is evolving rapidly, but success requires an understanding of what came before. In this opening session, we'll map the terrain of GUI agent research—from the early days of MiniWoB's simplified environments to today's complex, multimodal systems tackling real-world applications. You'll discover why standard vision models fail catastrophically on GUI tasks, explore the annotation bottlenecks that make GUI datasets so expensive to create, and understand the platform fragmentation that makes "click a button" mean twenty different things across datasets.

We'll dissect the most influential datasets (Mind2Web, AITW, Rico) and models that have shaped the field, examining their strengths, limitations, and the research gaps they reveal. By the end, you'll have a clear picture of where GUI agents excel, where they struggle, and, most importantly, where the opportunities lie for your own contributions.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Aug 15 - Visual Agent Workshop Part 1: Navigating the GUI Agent Landscape

Activities & events