talk-data.com talk-data.com

Topic

Data Collection

146

tagged

Activity Trend

17 peak/qtr
2020-Q1 2026-Q1

Activities

146 activities · Newest first

IoT-Based Data Analytics for the Healthcare Industry

IoT Based Data Analytics for the Healthcare Industry: Techniques and Applications explores recent advances in the analysis of healthcare industry data through IoT data analytics. The book covers the analysis of ubiquitous data generated by the healthcare industry, from a wide range of sources, including patients, doctors, hospitals, and health insurance companies. The book provides AI solutions and support for healthcare industry end-users who need to analyze and manipulate this vast amount of data. These solutions feature deep learning and a wide range of intelligent methods, including simulated annealing, tabu search, genetic algorithm, ant colony optimization, and particle swarm optimization. The book also explores challenges, opportunities, and future research directions, and discusses the data collection and pre-processing stages, challenges and issues in data collection, data handling, and data collection set-up. Healthcare industry data or streaming data generated by ubiquitous sensors cocooned into the IoT requires advanced analytics to transform data into information. With advances in computing power, communications, and techniques for data acquisition, the need for advanced data analytics is in high demand. Provides state-of-art methods and current trends in data analytics for the healthcare industry Addresses the top concerns in the healthcare industry using IoT and data analytics, and machine learning and deep learning techniques Discusses several potential AI techniques developed using IoT for the healthcare industry Explores challenges, opportunities, and future research directions, and discusses the data collection and pre-processing stages

Summary Event based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively are building a platform to manage the end to end flow of collaboration around what events are needed, how to structure the attributes, and how they are captured. In this episode founders Patrick Thompson and Ondrej Hrebicek discuss the problems that they have experienced as a result of inconsistent event schemas, how the Iteratively platform integrates the definition, development, and delivery of event data, and the benefits of elevating the visibility of event data for improving the effectiveness of the resulting analytics. If you are struggling with inconsistent implementations of event data collection, lack of clarity on what attributes are needed, and how it is being used then this is definitely a conversation worth following.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Patrick Thompson and Ondrej Hrebicek about Iteratively, a platform for enforcing consistent schemas for your event data

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Iteratively and your motivation for creating it? What are some of the ways that you have seen inconsistent message structures cause problems? What are some of the common anti-patterns that you have seen for managing the structure of event messages? What are the benefits that Iteratively provides for the different roles in an organization? Can you describe the workflow for a team using

Summary Wind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the overall efficiency of the turbines. Michael Tegtmeier founded Turbit Systems to help operators of wind farms identify and correct problems that contribute to suboptimal power outputs. In this episode he shares the story of how he got started working with wind energy, the system that he has built to collect data from the individual turbines, and how he is using machine learning to provide valuable insights to produce higher energy outputs. This was a great conversation about using data to improve the way the world works.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Michael Tegtmeier about Turbit, a machine learning powered platform for performance monitoring of wind farms

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Turbit and your motivation for creating the business? What are the most problematic factors that contribute to low performance in power generation with wind turbines? What is the current state of the art for accessing and analyzing data for wind farms? What information are you able to gather from the SCADA systems in the turbine?

How uniform is the availability and formatting of data from different manufacturers?

How are you handling data collection for the individual turbines?

How much information are you processing at the point of collection vs. sending to a centralized data store?

Can you describe the system architecture of Turbit and the lifecycle of turbine data as it propag

Intelligent Data Analysis
  This book focuses on methods and tools for intelligent data analysis, aimed at narrowing the increasing gap between data gathering and data comprehension, and emphasis will also be given to solving of problems which result from automated data collection, such as analysis of computer-based patient records, data warehousing tools, intelligent alarming, effective and efficient monitoring, and so on. This book aims to describe the different approaches of Intelligent Data Analysis from a practical point of view: solving common life problems with data analysis tools.

Summary We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Dr. Chris Mitchell and Dr. Thomas le Cornu about Audio Analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Audio Analytic?

What was your motivation for building an AI platform for sound recognition?

What are some of the ways that your platform is being used? What are the unique challenges that you have faced in working with arbitrary sound data? How do you handle the collection and labelling of the source data that you rely on for building your models?

Beyond just collection and storage, what is your process for defining a taxonomy of the audio data that you are working with? How has the taxonomy had to evolve, and what assumptions have had to change, as you progressed in building the data set and the resulting models?

challenges of building an embeddable AI model

update cycle

difficulty of identifying relevant audio and dealing with literal noise in the input data rights and ownership challenges in collection of source data What was your design process for constructing a pipeline for the audio data that you need to process? Can you describe how your overall data management system is

Summary The software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the data collection process from the lifecycle of your code. In this episode, CTO Liran Haimovitch discusses the benefits of shortening the iteration cycle and bringing non-engineers into the process of identifying useful data. This was a great conversation about the importance of democratizing the work of data collection.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Liran Haimovitch, CTO of Rookout, about the business value of operations metrics and other dark data in your organization

Interview

Introduction How did you get involved in the area of data management? Can you start by describing the types of data that we typically collect for the systems operations context?

What are some of the business questions that can be answered from these data sources?

What are some of the considerations that developers and operations engineers need to be aware of when they are defining the collection points for system metrics and log messages?

What are some effective strategies that you have found for including business stake holders in the process of defining these collection points?

One of the difficulties in building useful analyses from any source of data is maintaining the appropriate context. What are some of the necessary metadata that should be maintained along with operational metrics?

What are some of the shortcomings in the systems we design and use for operational data stores in terms of making the collected data useful for other purposes?

How does the existing tooling need to be changed or augmented to simplify the collaboration between engineers and stake holders for defining and collecting the needed information? The types of systems that we use for collecting and analyzing operations metrics are often designed and optimized for different access patterns and data formats than those used for analytical and exploratory purposes. What are your thoughts on how to incorporate the collected metrics with behavioral data? What are some of the other sources of dark data that we should keep an eye out for in our organizations?

Contact Info

LinkedIn @Liran_Last on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Rookout Cybersecurity DevOps DataDog Graphite Elasticsearch Logz.io Kafka

The intro and o

Building an Anonymization Pipeline

How can you use data in a way that protects individual privacy but still provides useful and meaningful analytics? With this practical book, data architects and engineers will learn how to establish and integrate secure, repeatable anonymization processes into their data flows and analytics in a sustainable manner. Luk Arbuckle and Khaled El Emam from Privacy Analytics explore end-to-end solutions for anonymizing device and IoT data, based on collection models and use cases that address real business needs. These examples come from some of the most demanding data environments, such as healthcare, using approaches that have withstood the test of time. Create anonymization solutions diverse enough to cover a spectrum of use cases Match your solutions to the data you use, the people you share it with, and your analysis goals Build anonymization pipelines around various data collection models to cover different business needs Generate an anonymized version of original data or use an analytics platform to generate anonymized outputs Examine the ethical issues around the use of anonymized data

Summary Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sean Knapp, Charlie Crocker about shadow IT in data and analytics

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing your definition of shadow IT? What are some of the reasons that members of an organization might start building their own solutions outside of what is supported by the engineering teams?

What are some of the roles in an organization that you have seen involved in these shadow IT projects?

What kinds of tools or platforms are well suited for being provisioned and managed without involvement from the platform team?

What are some of the pitfalls that these solutions present as a result of their initial ease of use?

What are the benefits to the organization of individuals or teams building and managing their own solutions? What are some of the risks associated with these implementations of data collection, storage, man

podcast_episode
by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Mike Robins (Poplin Data) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

Once upon a time, website behavioral data was extracted from the log files of web servers. That data was messy to work with and missing some information that analysts really wanted. This was the OG "server-side" data collection. Then, the JavaScript page tag arrived on the scene, and the data became richer and cleaner and easier to implement. That data was collected by tags firing in the user's browser (which was called "client-side" data collection). But then ad blockers and browser quirks and cross-device behavior turned out to introduce pockets of unreliability into THAT data. And now here we are. What was old is now somewhat new again, and there is a lot to be unpacked with the ins and outs and tradeoffs of client-side vs. server-side data collection. On this episode, Mike Robins from Poplin Data joined the gang to explore the topic from various angles. For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

The Decision Maker's Handbook to Data Science: A Guide for Non-Technical Executives, Managers, and Founders

Data science is expanding across industries at a rapid pace, and the companies first to adopt best practices will gain a significant advantage. To reap the benefits, decision makers need to have a confident understanding of data science and its application in their organization. It is easy for novices to the subject to feel paralyzed by intimidating buzzwords, but what many don’t realize is that data science is in fact quite multidisciplinary—useful in the hands of business analysts, communications strategists, designers, and more. With the second edition of The Decision Maker’s Handbook to Data Science, you will learn how to think like a veteran data scientist and approach solutions to business problems in an entirely new way. Author Stylianos Kampakis provides you with the expertise and tools required to develop a solid data strategy that is continuously effective. Ethics and legal issues surrounding data collection and algorithmic bias are some common pitfalls that Kampakis helps you avoid, while guiding you on the path to build a thriving data science culture at your organization. This updated and revised second edition, includes plenty of case studies, tools for project assessment, and expanded content for hiring and managing data scientists Data science is a language that everyone at a modern company should understand across departments. Friction in communication arises most often when management does not connect with what a data scientist is doing or how impactful data collection and storage can be for their organization. The Decision Maker’s Handbook to Data Science bridges this gap and readies you for both the present and future of your workplace in this engaging, comprehensive guide. What You Will Learn Understand how data science can be used within your business. Recognize the differences between AI, machine learning, and statistics. Become skilled at thinking like a data scientist, without being one. Discover how to hire and manage data scientists. Comprehend how to build the right environment in order to make your organization data-driven. Who This Book Is For Startup founders, product managers, higher level managers, and any other non-technical decision makers who are thinking to implement data science in their organization and hire data scientists. A secondary audience includes people looking for a soft introduction into the subject of data science.

Data Privacy and GDPR Handbook

The definitive guide for ensuring data privacy and GDPR compliance Privacy regulation is increasingly rigorous around the world and has become a serious concern for senior management of companies regardless of industry, size, scope, and geographic area. The Global Data Protection Regulation (GDPR) imposes complex, elaborate, and stringent requirements for any organization or individuals conducting business in the European Union (EU) and the European Economic Area (EEA)—while also addressing the export of personal data outside of the EU and EEA. This recently-enacted law allows the imposition of fines of up to 5% of global revenue for privacy and data protection violations. Despite the massive potential for steep fines and regulatory penalties, there is a distressing lack of awareness of the GDPR within the business community. A recent survey conducted in the UK suggests that only 40% of firms are even aware of the new law and their responsibilities to maintain compliance. The Data Privacy and GDPR Handbook helps organizations strictly adhere to data privacy laws in the EU, the USA, and governments around the world. This authoritative and comprehensive guide includes the history and foundation of data privacy, the framework for ensuring data privacy across major global jurisdictions, a detailed framework for complying with the GDPR, and perspectives on the future of data collection and privacy practices. Comply with the latest data privacy regulations in the EU, EEA, US, and others Avoid hefty fines, damage to your reputation, and losing your customers Keep pace with the latest privacy policies, guidelines, and legislation Understand the framework necessary to ensure data privacy today and gain insights on future privacy practices The Data Privacy and GDPR Handbook is an indispensable resource for Chief Data Officers, Chief Technology Officers, legal counsel, C-Level Executives, regulators and legislators, data privacy consultants, compliance officers, and audit managers.

Practical Time Series Analysis

Time series data analysis is increasingly important due to the massive production of such data through the internet of things, the digitalization of healthcare, and the rise of smart cities. As continuous monitoring and data collection become more common, the need for competent time series analysis with both statistical and machine learning techniques will increase. Covering innovations in time series data analysis and use cases from the real world, this practical guide will help you solve the most common data engineering and analysis challengesin time series, using both traditional statistical and modern machine learning techniques. Author Aileen Nielsen offers an accessible, well-rounded introduction to time series in both R and Python that will have data scientists, software engineers, and researchers up and running quickly. You’ll get the guidance you need to confidently: Find and wrangle time series data Undertake exploratory time series data analysis Store temporal data Simulate time series data Generate and select features for a time series Measure error Forecast and classify time series with machine or deep learning Evaluate accuracy and performance

Hands-On Exploratory Data Analysis with R

Immerse yourself in 'Hands-On Exploratory Data Analysis with R,' a comprehensive guide designed to hone your skills in data analysis using the powerful R programming language. This book walks you through all essential aspects of exploratory data analysis, from data collection and cleaning to generating insights with statistical and graphical methods, setting you up for success with any dataset. What this Book will help me do Utilize powerful R packages to accelerate your data analysis workflow. Effectively import, clean, and prepare diverse datasets for analysis. Create informative and visually appealing data visualizations using ggplot2. Generate comprehensive and sharable reports with R Markdown and knitr. Handle multi-factor, optimization, and regression data challenges. Author(s) Radhika Datar and Harish Garg are experienced data analysts and educators specializing in using R for practical data analysis. They have developed this book to share their depth of expertise, offering a detailed yet approachable learning experience. Their combined experience in teaching and applying data analysis in real-world scenarios makes this book an invaluable resource for practitioners. Who is it for? This book is perfect for data enthusiasts looking to strengthen their foundational knowledge in exploratory data analysis. Data analysts, engineers, software developers, and product managers seeking to broaden their skillset in data interpretation and visualization will find this guide extremely beneficial. Whether you're a beginner or already possess basic understanding of data analysis, this book will provide actionable insights to improve your workflow.

Machine Learning with R Quick Start Guide

Machine Learning with R Quick Start Guide takes you through the foundations of machine learning using the R programming language. Starting with the basics, this book introduces key algorithms and methodologies, offering hands-on examples and applicable machine learning solutions that allow you to extract insights and create predictive models. What this Book will help me do Understand the basics of machine learning and apply them using R 3.5. Learn to clean, prepare, and visualize data with R to ensure robust data analysis. Develop and work with predictive models using various machine learning techniques. Discover advanced topics like Natural Language Processing and neural network training. Implement end-to-end pipeline solutions, from data collection to predictive analytics, in R. Author(s) None Sanz, the author of Machine Learning with R Quick Start Guide, is an expert in data science with years of experience in the field of machine learning and R programming. Known for their accessible and detailed teaching style, the author focuses on providing practical knowledge to empower readers in the real world. Who is it for? This book is ideal for graduate students and professionals, including aspiring data scientists and data analysts, looking to start their journey in machine learning. Readers are expected to have some familiarity with the R programming language but no prior machine learning experience is necessary. With this book, the audience will gain the ability to confidently navigate machine learning concepts and practices.

Intelligent Data Analysis for Biomedical Applications

Intelligent Data Analysis for Biomedical Applications: Challenges and Solutions presents specialized statistical, pattern recognition, machine learning, data abstraction and visualization tools for the analysis of data and discovery of mechanisms that create data. It provides computational methods and tools for intelligent data analysis, with an emphasis on problem-solving relating to automated data collection, such as computer-based patient records, data warehousing tools, intelligent alarming, effective and efficient monitoring, and more. This book provides useful references for educational institutions, industry professionals, researchers, scientists, engineers and practitioners interested in intelligent data analysis, knowledge discovery, and decision support in databases. Provides the methods and tools necessary for intelligent data analysis and gives solutions to problems resulting from automated data collection Contains an analysis of medical databases to provide diagnostic expert systems Addresses the integration of intelligent data analysis techniques within biomedical information systems

Send us a text Jason Tatge, CEO, president and cofounder of Farmobile, joins the show to discuss data in the agriculture industry. The conversation touches on Jason's experience launching a startup, tips for finding success, and the value of big data from a farmer's perspective. This episode gives insight to data science for one of the oldest and most important sectors in our society.   

Show Notes

00:00 - Check us out on YouTube and SoundCloud. 00:10 - Connect with producer Liam Seston on LinkedIn and Twitter. 00:15 - Connect with producer Steve Moore on LinkedIn and Twitter. 00:24 - Connect with host Al Martin on LinkedIn and Twitter. 01:20 - Connect with guest Jason Tatge on LinkedIn and Twitter. 04:24 - Get some insights to commodity trading. 10:09 - Check out Farmobile.com. 14:21 - Here are some more reasons why data collection in farming is so important. 22:21 - How data collection in farming is driving greater efficiency. 27:33 - Learn about pipeline entrepreneurs here. Follow @IBMAnalytics Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

podcast_episode
by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Adam Greco (Hightouch) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

Have you ever had stakeholders complain that they're not getting the glorious insights they expect from your analytics program? Have you ever had to deliver the news that the specific data they're looking for isn't actually available with the current platforms you have implemented? Have you ever wondered if things might just be a whole lot easier if you threw your current platform out the window and started over with a new one? If you answered "yes" to any of these questions, then this might be just the episode for you. Adam "Omniman" Greco -- a co-worker at Analytics Demystified of the Kiss sister who is not a co-host of this podcast -- joined the gang to chat about the perils of unmaintained analytics tools, the unpleasant taste of stale business requirements, and the human-based factors that can contribute to keeping a tool that should be jettisoned or jettisoning a tool that, objectively, should really be kept! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

R Web Scraping Quick Start Guide

Discover the essentials of web scraping with R through this comprehensive guide. In this book, you will learn powerful techniques to extract valuable data from websites using the R programming language and tools like rvest and RSelenium. By understanding how to write efficient scripts, you will gain the ability to automate data collection and analysis for your projects. What this Book will help me do Understand the fundamentals of web scraping and its applications. Master the use of rvest for extracting data from static websites. Learn advanced techniques for dynamic websites using RSelenium. Write effective RegEx and XPath rules to enhance data extraction. Store, manage, and visualize the scraped data efficiently. Author(s) None Aydin is an experienced data analyst and R programmer with a deep passion for data manipulation and analysis. With years of firsthand expertise in utilizing R for various data-related tasks, Aydin brings a practical and methodological approach to teaching complex concepts. His clear instruction style ensures that readers quickly grasp and apply the techniques taught in this book. Who is it for? This book is ideal for R programmers seeking to expand their skills by delving into web scraping techniques. Whether you are a beginner with a basic knowledge of R or a data analyst exploring new ways to extract and utilize data, this guide is tailored for you. It suits readers who aspire to automate data collection and expand their analytical capabilities.

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

Introductions How did you get involved in the area of data engineering and data management? What is Snowplow Analytics and what problem were you trying to solve when you started the company? What is unique about customer event data from an ingestion and processing perspective? Challenges with properly matching up data between sources Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?

Cleanliness/accuracy

What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly? Can you describe the overall architecture of the ingest pipeline that Snowplow provides?

How has that architecture evolved from when you first started? What would you do differently if you were to start over today?

Ensuring appropriate use of enrichment sources What have been some of the biggest challenges encountered while building and evolving Snowplow? What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Alex

@alexcrdean on Twitter LinkedIn

Snowplow

@snowplowdata on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Snowplow

GitHub

Deloitte Consulting OpenX Hadoop AWS EMR (Elastic Map-Reduce) Business Intelligence Data Warehousing Google Analytics CRM (Customer Relationship Management) S3 GDPR (General Data Protection Regulation) Kinesis Kafka Google Cloud Pub-Sub JSON-Schema Iglu IAB Bots And Spiders List Heap Analytics

Podcast Interview

Redshift SnowflakeDB Snowplow Insights Googl

Summary

With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy

Interview

Introduction How did you get involved in the area of data management? What is Ona and how did the company get started?

What are some examples of the types of customers that you work with?

What types of data do you support in your collection platform? What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users? Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization? What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers? Can you describe the flow of the data from collection through to analysis? To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?

What are the architectural considerations that you factored in when designing it? What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?

What are your plans for the future of Ona and Canopy?

Contact Info

Email pld on Github Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

OpenSRP Ona Canopy Open Data Kit Earth Institute at Columbia University Sustainable Engineering Lab WHO Bill and Melinda Gates Foundation XLSForms PostGIS Kafka Druid Superset Postgres Ansible Docker Terraform

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast