talk-data.com talk-data.com

Topic

Data Collection

16

tagged

Activity Trend

17 peak/qtr
2020-Q1 2026-Q1

Activities

16 activities · Newest first

Bridging Accessibility and AI: Sign Language Recognition & Inclusive Design with Sheida Rashidi

As AI continues to shape human-computer interaction, there’s a growing opportunity and responsibility to ensure these technologies serve everyone, including people with communication disabilities. In this talk, I will present my ongoing work in developing a real-time American Sign Language (ASL) recognition system, and explore how integrating accessible design principles into AI research can expand both usability and impact.

The core of the talk will cover the Sign Language Recogniser project (available on GitHub), in which I used MediaPipe Studio together with TensorFlow, Keras, and OpenCV to train a model that classifies ASL letters from hand-tracking features.

I’ll share the methodology: data collection, feature extraction via MediaPipe, model training, and demo/testing results. I’ll also discuss challenges encountered, such as dealing with gesture variability, lighting and camera differences, latency constraints, and model generalization.

Beyond the technical implementation, I’ll reflect on the broader implications: how accessibility-focused AI projects can promote inclusion, how design decisions affect trust and usability, and how women in AI & data science can lead innovation that is both rigorous and socially meaningful. Attendees will leave with actionable insights for building inclusive AI systems, especially in domains involving rich human modalities such as gesture or sign.

The Elephant in the room between data collection and data science with Katya Kovalenko

Whether you call it wrangling, cleaning, or preprocessing, data prep is often the most expensive and time-consuming part of the analytical pipeline. It may involve converting data into machine-readable formats, integrating across many datasets or outlier detection, and it can be a large source of error if done manually. Lack of machine-readable or integrated data limits connectivity across fields and data accessibility, sharing, and reuse, becoming a significant contributor to research waste.

For students, it is perhaps the greatest barrier to adopting quantitative tools and advancing their coding and analytical skills. AI tools are available for automating the cleanup and integration, but due to the one-of-a-kind nature of these problems, these approaches still require extensive human collaboration and testing. I review some of the common challenges in data cleanup and integration, approaches for understanding dataset structures, and strategies for developing and testing workflows.

Benchmarking 2000+ Cloud Servers for GBM Model Training and LLM Inference Speed

Spare Cores is a Python-based, open-source, and vendor-independent ecosystem collecting, generating, and standardizing comprehensive data on cloud server pricing and performance. In our latest project, we started 2000+ server types across five cloud vendors to evaluate their suitability for serving Large Language Models from 135M to 70B parameters. We tested how efficiently models can be loaded into memory of VRAM, and measured inference speed across varying token lengths for prompt processing and text generation. The published data can help you find the optimal instance type for your LLM serving needs, and we will also share our experiences and challenges with the data collection and insights into general patterns.

Leveling Up Gaming Analytics: How Supercell Evolved Player Experiences With Snowplow and Databricks

In the competitive gaming industry, understanding player behavior is key to delivering engaging experiences. Supercell, creators of Clash of Clans and Brawl Stars, faced challenges with fragmented data and limited visibility into user journeys. To address this, they partnered with Snowplow and Databricks to build a scalable, privacy-compliant data platform for real-time insights. By leveraging Snowplow’s behavioral data collection and Databricks’ Lakehouse architecture, Supercell achieved: Cross-platform data unification: A unified view of player actions across web, mobile and in-game Real-time analytics: Streaming event data into Delta Lake for dynamic game balancing and engagement Scalable infrastructure: Supporting terabytes of data during launches and live events AI & ML use cases: Churn prediction and personalized in-game recommendations This session explores Supercell’s data journey and AI-driven player engagement strategies.

Sponsored by: Oxylabs | Web Scraping and AI: A Quiet but Critical Partnership

Behind every powerful AI system lies a critical foundation: fresh, high-quality web data. This session explores the symbiotic relationship between web scraping and artificial intelligence that's transforming how technical teams build data-intensive applications. We'll showcase how this partnership enables crucial use cases: analyzing trends, forecasting behaviors, and enhancing AI models with real-time information. Technical challenges that once made web scraping prohibitively complex are now being solved through the very AI systems they help create. You'll learn how machine learning revolutionizes web data collection, making previously impossible scraping projects both feasible and maintainable, while dramatically reducing engineering overhead and improving data quality. Join us to explore this quiet but critical partnership that's powering the next generation of AI applications.

Optimize Cost and User Value Through Model Routing AI Agent

Each LLM has unique strengths and weaknesses, and there is no one-size-fits-all solution. Companies strive to balance cost reduction with maximizing the value of their use cases by considering various factors such as latency, multi-modality, API costs, user need, and prompt complexity. Model routing helps in optimizing performance and cost along with enhanced scalability and user satisfaction. Overview of cost-effective models training using AI gateway logs, user feedback, prompt, and model features to design an intelligent model-routing AI agent. Covers different strategies for model routing, deployment in Mosaic AI, re-training, and evaluation through A/B testing and end-to-end Databricks workflows. Additionally, it will delve into the details of training data collection, feature engineering, prompt formatting, custom loss functions, architectural modifications, addressing cold-start problems, query embedding generation and clustering through VectorDB, and RL policy-based exploration.

Sigma Data Apps Product Releases & Roadmap | The Data Apps Conference

Organizations today require more than dashboards—they need applications that combine insights with data collection and action capabilities to drive meaningful change. In this session, Stipo Josipovic (Director of Product) will showcase the key innovations enabling this shift, from expanded write-back capabilities to workflow automation features.

You'll learn about Sigma's growing data app capabilities, including:

Enhanced write-back features: Redshift and upcoming BigQuery support, bulk data entry, and form-based collection for structured workflows Advanced security controls: Conditional editing and row-level security for precise data governance Intuitive interface components: Containers, modals, and tabbed navigation for app-like experiences Powerful Actions framework: API integrations, notifications, and automated triggers to drive business processes This session covers both recently released features and Sigma's upcoming roadmap, including detail views, simplified form-building, and new API actions to integrate with your tech stack. Discover how Sigma helps organizations move beyond analysis to meaningful action.

➡️ Learn more about Data Apps: https://www.sigmacomputing.com/product/data-applications?utm_source=youtube&utm_medium=organic&utm_campaign=data_apps_conference&utm_content=pp_data_apps


➡️ Sign up for your free trial: https://www.sigmacomputing.com/go/free-trial?utm_source=youtube&utm_medium=video&utm_campaign=free_trial&utm_content=free_trial

sigma #sigmacomputing #dataanalytics #dataanalysis #businessintelligence #cloudcomputing #clouddata #datacloud #datastructures #datadriven #datadrivendecisionmaking #datadriveninsights #businessdecisions #datadrivendecisions #embeddedanalytics #cloudcomputing #SigmaAI #AI #AIdataanalytics #AIdataanalysis #GPT #dataprivacy #python #dataintelligence #moderndataarchitecture

Automating Data Quality via Shift Left for Real-Time Web Data Feeds at Industrial Scale | Sarah M...

Automating Data Quality via Shift Left for Real-Time Web Data Feeds at Industrial Scale | Sarah McKenna | Shift Left Data Conference 2025

Real-time web data is one of the hardest data streams to automate with trust since web sites don't want to be scraped, are constantly changing with no notice, and employ sophisticated bot blocking mechanisms to try to stop automated data collection. At Sequentum we cut our teeth on web data and have come out with a general purpose cloud platform for any type of data ingestion and data enrichment that our clients can transparently audit and ultimately trust to get their mission critical data delivered on time and with quality to fuel their business decision making.

Continuous Data Pipeline for Real time Benchmarking & Data Set Augmentation | Teleskope

ABOUT THE TALK: Building and curating representative datasets is crucial for accurate ML systems. Monitoring metrics post-deployment helps improve the model. Unstructured language models may face data shifts, leading to unpredictable inferences. Open-source APIs and annotation tools streamline annotation and reduce analyst workload.

This talk discusses generating datasets and real-time precision/recall splits to detect data shifts, prioritize data collection, and retrain models.

ABOUT THE SPEAKER: Ivan Aguilar is a data scientist at Teleskope focused on building scalable models for detecting PII/PHI/Secrets and other compliance related entities within customers' clouds. Prior to joining Teleskope, Ivan was a ML Engineer at Forge.AI, a Boston based shop working on information extraction, content extraction, and other NLP related tasks.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Protecting PII/PHI Data in Data Lake via Column Level Encryption

Data breach is a concern for any data collection company including Northwestern mutual. Every measure is taken to avoid the identity theft and fraud for our customers; however they are still not sufficient if the security around it is not updated periodically. A multiple layer of encryption is the most common approach utilized to avoid breaches however unauthorized internal access to this sensitive data still poses a threat

This presentation will walk you following steps: - Design to build encryption at column level - How to protect PII data that is used as key for joins - Ability for authorized users to decrypt data at run time - Ability to rotate the encryption keys if needed

At Northwestern Mutual, a combination of Fernet, AES encryption libraries, user-defined functions (UDFs), and Databricks secrets, were utilized to develop a process to encrypt PII information. Access was only provided to those with a business need to decrypt it, this helps avoids the internal threat. This is also done without data duplication or metadata (view/tables) duplication. Our goal is to help you understand on how you can build a secure data lake for your organization which can eliminate threats of data breach internally and externally. Associated blog: https://databricks.com/blog/2020/11/20/enforcing-column-level-encryption-and-avoiding-data-duplication-with-pii.html

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Unifying Data Science and Business: AI Augmentation/Integration in Production Business Applications

Why is it so hard to integrate Machine Learning into real business applications? In 2019 Gartner predicted that AI augmentation would solve this problem and would create will create $2.9 trillion of business value and 6.2 billion hours of worker productivity in 2021. A new realm of business science methods that encompass AI-powered analytics that allows people with domain expertise to make smarter decisions faster and with more confidence have also emerged as a solution to this problem. Dr. Harvey will demystify why integration challenges still account for $30.2 billion in annual global losses and discuss what it takes to integrate AI/ML code or algorithms into real business applications and the effort that goes into making each component, including data collection, preparation, training, and serving production-ready, enabling organizations to use the results of integrated models repeatedly with minimal user intervention. Finally, Dr. Harvey will discuss AISquared’s integration with Databricks and MLFlow to accelerate the integration of AI by unifying data science with business. By adding five lines of code to your model, users can now leverage AISquared’s model integration API framework which provides a quick and easy way to integrate models directly into live business applications.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Distributed Machine Learning at Lyft

Data collection, preprocessing, feature engineering are the fundamental steps in any Machine Learning Pipeline. After feature engineering, being able to parallelize training on multiple low cost machines helps to reduce cost and time both. And, then being able to train models in a distributed manner speeds up Hyperparameter Tuning. How can we unify these stages of ML Pipeline in one unified distributed training platform together? And that too on Kubernetes?

Our ML platform is completely based on Kubernetes because of its scalability and rapid bootstrapping time of resources. In this talk we will demonstrate how Lyft uses Spark on Kubernetes, Fugue (our home grown unifying compute abstraction layer) to design a holistic end to end ML Pipeline system for distributed feature engineering, training & prediction experience for our customers on our ML Platform on top of Spark on K8s. We will also do a deep dive to show how we are abstracting and hiding infrastructure complexities so that our Data Scientists and Research Scientist can focus only on the business logic for their models through simple pythonic APIs and SQL. We let the users focus on ''what to do'' and the platform takes care of ''how to do''. We will share our challenges, learning and the fun we had while implementing. Using Spark on K8s have helped us achieve large scale data processing with 90% less cost and at times bringing down processing time from 2 hours to less than 20 mins.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ANALYTICS IN THE AGE OF THE MODERN DATA STACK

The pace of change in the analytics sector increased dramatically since 2012 with tons of new tools, paving the way to the birth of the Modern Data Stack. The rapid explosion of tools is met with a rapid explosion of restrictions, challenging the status quo of data collection, processing and storage. How does that reflect on Analytics and its future?

SERVER-SIDE TAGGING: DATA QUALITY OR DATA QUANTITY?

Simo explores the latest and greatest paradigm in Google's marketing stack: server-side tagging in Google Tag Manager. The benefits of moving data collection server-side are obvious – or are they? The same tools and mechanisms that help with data governance and oversight can be abused due to the opaqueness associated with moving data collections server-side. In this talk, Simo takes a honest look at just what problems server-side tagging seeks to address, and whether it actually manages to do what it’s set out to do.