talk-data.com talk-data.com

Topic

Data Collection

146

tagged

Activity Trend

17 peak/qtr
2020-Q1 2026-Q1

Activities

146 activities · Newest first

Continuous Data Pipeline for Real time Benchmarking & Data Set Augmentation | Teleskope

ABOUT THE TALK: Building and curating representative datasets is crucial for accurate ML systems. Monitoring metrics post-deployment helps improve the model. Unstructured language models may face data shifts, leading to unpredictable inferences. Open-source APIs and annotation tools streamline annotation and reduce analyst workload.

This talk discusses generating datasets and real-time precision/recall splits to detect data shifts, prioritize data collection, and retrain models.

ABOUT THE SPEAKER: Ivan Aguilar is a data scientist at Teleskope focused on building scalable models for detecting PII/PHI/Secrets and other compliance related entities within customers' clouds. Prior to joining Teleskope, Ivan was a ML Engineer at Forge.AI, a Boston based shop working on information extraction, content extraction, and other NLP related tasks.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Building machine learning systems with high predictive accuracy is inherently hard, and embedding these systems into great product experiences is doubly so. To build truly great machine learning products that reach millions of users, organizations need to marry great data science expertise, with strong attention to user experience, design thinking, and a deep consideration for the impacts of your prediction on users and stakeholders. So how do you do that? Today’s guest is Sam Stone, Director of Product Management, Pricing & Data at Opendoor, a real-estate technology company that leverages machine learning to streamline the home buying and selling process. Sam played an integral part in developing AI/ML products related to home pricing including the Opendoor Valuation Model (OVM), market liquidity forecasting, portfolio optimization, and resale decision tooling. Prior to Opendoor, he was a co-founder and product manager at Ansaro, a SaaS startup using data science and machine learning to help companies improve hiring decisions. Sam holds degrees in Math and International Relations from Stanford and an MBA from Harvard. Throughout the episode, we spoke about his principles for great ML product design, how to think about data collection for these types of products, how to package outputs from a model within a slick user interface, what interpretability means in the eyes of customers, how to be proactive about monitoring failure points, and much more.

Advances in Business Statistics, Methods and Data Collection

ADVANCES IN BUSINESS STATISTICS, METHODS AND DATA COLLECTION Advances in Business Statistics, Methods and Data Collection delivers insights into the latest state of play in producing establishment statistics, obtained from businesses, farms and institutions. Presenting materials and reflecting discussions from the 6 th International Conference on Establishment Statistics (ICES-VI), this edited volume provides a broad overview of methodology underlying current establishment statistics from every aspect of the production life cycle while spotlighting innovative and impactful advancements in the development, conduct, and evaluation of modern establishment statistics programs. Highlights include: Practical discussions on agile, timely, and accurate measurement of rapidly evolving economic phenomena such as globalization, new computer technologies, and the informal sector. Comprehensive explorations of administrative and new data sources and technologies, covering big (organic) data sources and methods for data integration, linking, machine learning and visualization. Detailed compilations of statistical programs’ responses to wide-ranging data collection and production challenges, among others caused by the Covid-19 pandemic. In-depth examinations of business survey questionnaire design, computerization, pretesting methods, experimentation, and paradata. Methodical presentations of conventional and emerging procedures in survey statistics techniques for establishment statistics, encompassing probability sampling designs and sample coordination, non-probability sampling, missing data treatments, small area estimation and Bayesian methods. Providing a broad overview of most up-to-date science, this book challenges the status quo and prepares researchers for current and future challenges in establishment statistics and methods. Perfect for survey researchers, government statisticians, National Bank employees, economists, and undergraduate and graduate students in survey research and economics, Advances in Business Statistics, Methods and Data Collection will also earn a place in the toolkit of researchers working –with data– in industries across a variety of fields.

Building Solutions with the Microsoft Power Platform

With the accelerating speed of business and the increasing dependence on technology, companies today are significantly changing the way they build in-house business solutions. Many now use low-code and no code technologies to help them deal with specific issues, but that's just the beginning. With this practical guide, power users and developers will discover ways to resolve everyday challenges by building end-to-end solutions with the Microsoft Power Platform. Author Jason Rivera, who specializes in SharePoint and the Microsoft 365 solution architecture, provides a comprehensive overview of how to use the Power Platform to build end-to-end solutions that address tactical business needs. By learning key components of the platform, including Power Apps, Power Automate, and Power BI, you'll be able to build low-code and no code applications, automate repeatable business processes, and create interactive reports from available data. Learn how the Power Platform apps work together Incorporate AI into the Power Platform without extensive ML or AI knowledge Create end-to-end solutions to solve tactical business needs, including data collection, process automation, and reporting Build AI-based solutions using Power Virtual Agents and AI Builder

With the increasing rate at which new data tools and platforms are being created, the modern data stack risks becoming just another buzzword data leaders use when talking about how they solve problems.

Alongside the arrival of new data tools is the need for leaders to see beyond just the modern data stack and think deeply about how their data work can align with business outcomes, otherwise, they risk falling behind trying to create value from innovative, but irrelevant technology.

In this episode, Yali Sassoon joins the show to explore what the modern data stack really means, how to rethink the modern data stack in terms of value creation, data collection versus data creation, and the right way businesses should approach data ingestion, and much more.

Yali is the Co-Founder and Chief Strategy Officer at Snowplow Analytics, a behavioral data platform that empowers data teams to solve complex data challenges. Yali is an expert in data with a background in both strategy and operations consulting teaching companies how to use data properly to evolve their operations and improve their results.

Unlocking the Value of Real-Time Analytics

Storing data and making it accessible for real-time analysis is a huge challenge for organizations today. In 2020 alone, 64.2 billion GB of data was created or replicated, and it continues to grow. With this report, data engineers, architects, and software engineers will learn how to do deep analysis and automate business decisions while keeping your analytical capabilities timely. Author Christopher Gardner takes you through current practices for extracting data for analysis and uncovers the opportunities and benefits of making that data extraction and analysis continuous. By the end of this report, you’ll know how to use new and innovative tools against your data to make real-time decisions. And you’ll understand how to examine the impact of real-time analytics on your business. Learn the four requirements of real-time analytics: latency, freshness, throughput, and concurrency Determine where delays between data collection and actionable analytics occur Understand the reasons for real-time analytics and identify the tools you need to reach a faster, more dynamic level Examine changes in data storage and software while learning methodologies for overcoming delays in existing database architecture Explore case studies that show how companies use columnar data, sharding, and bitmap indexing to store and analyze data Fast and fresh data can make the difference between a successful transaction and a missed opportunity. The report shows you how.

In this episode, Jason Foster talks to George McCrea, Chief of Staff at Royal Engineers Geographic. They explore the fascinating role of geospatial data soldiers in the military, how the military's experience with data collection and analysis has developed over centuries and how it's used to improve the lives of civilians in a variety of ways and to support government bodies to make strategic decisions.

Summary In order to improve efficiency in any business you must first know what is contributing to wasted effort or missed opportunities. When your business operates across multiple locations it becomes even more challenging and important to gain insights into how work is being done. In this episode Tommy Yionoulis shares his experiences working in the service and hospitality industries and how that led him to found OpsAnalitica, a platform for collecting and analyzing metrics on multi location businesses and their operational practices. He discusses the challenges of making data collection purposeful and efficient without distracting employees from their primary duties and how business owners can use the provided analytics to support their staff in their duties.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced

Protecting PII/PHI Data in Data Lake via Column Level Encryption

Data breach is a concern for any data collection company including Northwestern mutual. Every measure is taken to avoid the identity theft and fraud for our customers; however they are still not sufficient if the security around it is not updated periodically. A multiple layer of encryption is the most common approach utilized to avoid breaches however unauthorized internal access to this sensitive data still poses a threat

This presentation will walk you following steps: - Design to build encryption at column level - How to protect PII data that is used as key for joins - Ability for authorized users to decrypt data at run time - Ability to rotate the encryption keys if needed

At Northwestern Mutual, a combination of Fernet, AES encryption libraries, user-defined functions (UDFs), and Databricks secrets, were utilized to develop a process to encrypt PII information. Access was only provided to those with a business need to decrypt it, this helps avoids the internal threat. This is also done without data duplication or metadata (view/tables) duplication. Our goal is to help you understand on how you can build a secure data lake for your organization which can eliminate threats of data breach internally and externally. Associated blog: https://databricks.com/blog/2020/11/20/enforcing-column-level-encryption-and-avoiding-data-duplication-with-pii.html

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Unifying Data Science and Business: AI Augmentation/Integration in Production Business Applications

Why is it so hard to integrate Machine Learning into real business applications? In 2019 Gartner predicted that AI augmentation would solve this problem and would create will create $2.9 trillion of business value and 6.2 billion hours of worker productivity in 2021. A new realm of business science methods that encompass AI-powered analytics that allows people with domain expertise to make smarter decisions faster and with more confidence have also emerged as a solution to this problem. Dr. Harvey will demystify why integration challenges still account for $30.2 billion in annual global losses and discuss what it takes to integrate AI/ML code or algorithms into real business applications and the effort that goes into making each component, including data collection, preparation, training, and serving production-ready, enabling organizations to use the results of integrated models repeatedly with minimal user intervention. Finally, Dr. Harvey will discuss AISquared’s integration with Databricks and MLFlow to accelerate the integration of AI by unifying data science with business. By adding five lines of code to your model, users can now leverage AISquared’s model integration API framework which provides a quick and easy way to integrate models directly into live business applications.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Distributed Machine Learning at Lyft

Data collection, preprocessing, feature engineering are the fundamental steps in any Machine Learning Pipeline. After feature engineering, being able to parallelize training on multiple low cost machines helps to reduce cost and time both. And, then being able to train models in a distributed manner speeds up Hyperparameter Tuning. How can we unify these stages of ML Pipeline in one unified distributed training platform together? And that too on Kubernetes?

Our ML platform is completely based on Kubernetes because of its scalability and rapid bootstrapping time of resources. In this talk we will demonstrate how Lyft uses Spark on Kubernetes, Fugue (our home grown unifying compute abstraction layer) to design a holistic end to end ML Pipeline system for distributed feature engineering, training & prediction experience for our customers on our ML Platform on top of Spark on K8s. We will also do a deep dive to show how we are abstracting and hiding infrastructure complexities so that our Data Scientists and Research Scientist can focus only on the business logic for their models through simple pythonic APIs and SQL. We let the users focus on ''what to do'' and the platform takes care of ''how to do''. We will share our challenges, learning and the fun we had while implementing. Using Spark on K8s have helped us achieve large scale data processing with 90% less cost and at times bringing down processing time from 2 hours to less than 20 mins.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Mastering Microsoft Power BI - Second Edition

Dive deep into Microsoft Power BI with the second edition of 'Mastering Microsoft Power BI'. This comprehensive book equips you with the skills to transform business data into actionable insights using Power BI's latest features and techniques. From efficient data retrieval and transformation processes to creating interactive dashboards that tell impactful data stories, you will learn actionable knowledge every step of the way. What this Book will help me do Learn to master data collection and modeling using the Power Query M language Gain expertise in designing DirectQuery, import, and composite data models Understand how to create advanced analytics reports using DAX and Power BI visuals Learn to manage the Power BI environment as an administrator with Premium capacity Develop insightful, scalable, and visually impactful dashboards and reports Author(s) Greg Deckler, a seasoned Power BI expert and solution architect, and None Powell, an experienced BI consultant and data visualization specialist, bring their extensive practical knowledge to this book. Together, they share their real-world expertise and proven techniques applying Power BI's diverse capabilities. Who is it for? This book is ideal for business intelligence professionals and intermediate Power BI users. If you're looking to master data visualization, prepare insightful dashboards, and explore Power BI's full potential, this is for you. Basic understanding of BI concepts and familiarity with Power BI will ensure you get the most value.

ANALYTICS IN THE AGE OF THE MODERN DATA STACK

The pace of change in the analytics sector increased dramatically since 2012 with tons of new tools, paving the way to the birth of the Modern Data Stack. The rapid explosion of tools is met with a rapid explosion of restrictions, challenging the status quo of data collection, processing and storage. How does that reflect on Analytics and its future?

SERVER-SIDE TAGGING: DATA QUALITY OR DATA QUANTITY?

Simo explores the latest and greatest paradigm in Google's marketing stack: server-side tagging in Google Tag Manager. The benefits of moving data collection server-side are obvious – or are they? The same tools and mechanisms that help with data governance and oversight can be abused due to the opaqueness associated with moving data collections server-side. In this talk, Simo takes a honest look at just what problems server-side tagging seeks to address, and whether it actually manages to do what it’s set out to do.

Managing and Visualizing Your BIM Data

Managing and Visualizing Your BIM Data is an essential guide for AEC professionals who want to harness the power of data to enhance their projects. Designed with a hands-on approach, this book delves into using Autodesk Dynamo for data collection and Microsoft Power BI for creating insightful dashboards. By the end, readers will be adept at connecting BIM models to interactive visualizations. What this Book will help me do Gain a deep understanding of data collection workflows in Autodesk Dynamo. Learn to connect Building Information Modeling (BIM) data to Power BI dashboards. Master the basics and advanced features of Dynamo for BIM data management. Create dynamic and visually appealing Power BI dashboards for AEC projects. Explore real-world use cases with expert-guided hands-on examples. Author(s) The authors, None Pellegrino, None Bottiglieri, None Crump, None Pieper, and None Touil, are experienced professionals in the AEC and software development industries. With extensive backgrounds in Building Information Modeling (BIM) and data visualization, they bring practical insights combined with a passion for teaching. Their approach ensures readers not only learn the tools but also understand the reasoning behind best practices. Who is it for? This book is ideal for BIM managers and coordinators, design technology managers, and other Architecture, Engineering, and Construction (AEC) professionals. Readers with a foundational knowledge of BIM will find it particularly beneficial for enhancing their data analysis and reporting capabilities. If you're aiming to elevate your skill set in managing BIM data and creating impactful visualizations, this guide is for you.

Summary Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities. In this episode Jillian Rowe shares her experience of working in the field and supporting teams of scientists and analysts with the data infrastructure that they need to get their work done. This is a fascinating exploration of the collaboration between data professionals and scientists.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Your host is Tobias Macey and today I’m interviewing Jillian Rowe about data engineering practices for bioinformatics projects

Interview

Introduction How did you get involved in the area of data management? How did you get into the field of bioinformatics? Can you describe what is unique about data needs in bioinformatics? What are some of the problems that you have found yourself regularly solving for your clients? When building data engineering stacks for bioinformatics, what are the attributes that you are optimizing for? (e.g. speed, UX, scale, correctness, etc.) Can you describe a typical set of technologies that you implement when working on a new project?

What kinds of systems do you need to integrate with?

What are the data formats that ar

We talked about:

Data-led academy Arpit’s background Growth marketing Being data-led Data-led vs data-driven Documenting your data: creating a tracking plan Understanding your data Tools for creating a tracking plan Data flow stages Tracking events — examples Collecting the data Storing and analyzing the data Data activation Tools for data collection Data warehouses Reverse ETL tools Customer data platforms Modern data stack for growth Buy vs build People we need to in the data flow Data democratization Motivating people to document data Product-led vs data-led

Links:

https://dataled.academy/

Join our Slack: https://datatalks.club/slack.html

Advances in Longitudinal Survey Methodology

Advances in Longitudinal Survey Methodology Explore an up-to-date overview of best practices in the implementation of longitudinal surveys from leading experts in the field of survey methodology Advances in Longitudinal Survey Methodology delivers a thorough review of the most current knowledge in the implementation of longitudinal surveys. The book provides a comprehensive overview of the many advances that have been made in the field of longitudinal survey methodology over the past fifteen years, as well as extending the topic coverage of the earlier volume, “Methodology of Longitudinal Surveys”, published in 2009. This new edited volume covers subjects like dependent interviewing, interviewer effects, panel conditioning, rotation group bias, measurement of cognition, and weighting. New chapters discussing the recent shift to mixed-mode data collection and obtaining respondents’ consent to data linkage add to the book’s relevance to students and social scientists seeking to understand modern challenges facing data collectors today. Readers will also benefit from the inclusion of: A thorough introduction to refreshment sampling for longitudinal surveys, including consideration of principles, sampling frame, sample design, questionnaire design, and frequency An exploration of the collection of biomarker data in longitudinal surveys, including detailed measurements of ill health, biological pathways, and genetics in longitudinal studies An examination of innovations in participant engagement and tracking in longitudinal surveys, including current practices and new evidence on internet and social media for participant engagement. An invaluable source for post-graduate students, professors, and researchers in the field of survey methodology, Advances in Longitudinal Survey Methodology will also earn a place in the libraries of anyone who regularly works with or conducts longitudinal surveys and requires a one-stop reference for the latest developments and findings in the field.