talk-data.com talk-data.com

Topic

GDPR/CCPA

data_privacy compliance regulations

95

tagged

Activity Trend

9 peak/qtr
2020-Q1 2026-Q1

Activities

95 activities · Newest first

Information Privacy Engineering and Privacy by Design: Understanding Privacy Threats, Technology, and Regulations Based on Standards and Best Practices

The Comprehensive Guide to Engineering and Implementing Privacy Best Practices As systems grow more complex and cybersecurity attacks more relentless, safeguarding privacy is ever more challenging. Organizations are increasingly responding in two ways, and both are mandated by key standards such as GDPR and ISO/IEC 27701:2019. The first approach, privacy by design, aims to embed privacy throughout the design and architecture of IT systems and business practices. The second, privacy engineering, encompasses the technical capabilities and management processes needed to implement, deploy, and operate privacy features and controls in working systems. In Information Privacy Engineering and Privacy by Design, internationally renowned IT consultant and author William Stallings brings together the comprehensive knowledge privacy executives and engineers need to apply both approaches. Using the techniques he presents, IT leaders and technical professionals can systematically anticipate and respond to a wide spectrum of privacy requirements, threats, and vulnerabilities–addressing regulations, contractual commitments, organizational policies, and the expectations of their key stakeholders. • Review privacy-related essentials of information security and cryptography • Understand the concepts of privacy by design and privacy engineering • Use modern system access controls and security countermeasures to partially satisfy privacy requirements • Enforce database privacy via anonymization and de-identification • Prevent data losses and breaches • Address privacy issues related to cloud computing and IoT • Establish effective information privacy management, from governance and culture to audits and impact assessment • Respond to key privacy rules including GDPR, U.S. federal law, and the California Consumer Privacy Act This guide will be an indispensable resource for anyone with privacy responsibilities in any organization, and for all students studying the privacy aspects of cybersecurity.

Summary The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Karen Heaton and Mark Sherwood-Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines.

Interview

Introduction How did you get involved in the are

Data Privacy and GDPR Handbook

The definitive guide for ensuring data privacy and GDPR compliance Privacy regulation is increasingly rigorous around the world and has become a serious concern for senior management of companies regardless of industry, size, scope, and geographic area. The Global Data Protection Regulation (GDPR) imposes complex, elaborate, and stringent requirements for any organization or individuals conducting business in the European Union (EU) and the European Economic Area (EEA)—while also addressing the export of personal data outside of the EU and EEA. This recently-enacted law allows the imposition of fines of up to 5% of global revenue for privacy and data protection violations. Despite the massive potential for steep fines and regulatory penalties, there is a distressing lack of awareness of the GDPR within the business community. A recent survey conducted in the UK suggests that only 40% of firms are even aware of the new law and their responsibilities to maintain compliance. The Data Privacy and GDPR Handbook helps organizations strictly adhere to data privacy laws in the EU, the USA, and governments around the world. This authoritative and comprehensive guide includes the history and foundation of data privacy, the framework for ensuring data privacy across major global jurisdictions, a detailed framework for complying with the GDPR, and perspectives on the future of data collection and privacy practices. Comply with the latest data privacy regulations in the EU, EEA, US, and others Avoid hefty fines, damage to your reputation, and losing your customers Keep pace with the latest privacy policies, guidelines, and legislation Understand the framework necessary to ensure data privacy today and gain insights on future privacy practices The Data Privacy and GDPR Handbook is an indispensable resource for Chief Data Officers, Chief Technology Officers, legal counsel, C-Level Executives, regulators and legislators, data privacy consultants, compliance officers, and audit managers.

EU General Data Protection Regulation (GDPR), third edition - An Implementation and Compliance Guide

EU GDPR – An Implementation and Compliance Guide is a perfect companion for anyone managing a GDPR compliance project. It explains the changes you need to make to your data protection and information security regimes and tells you exactly what you need to do to avoid severe financial penalties.

EU GDPR & EU-U.S. Privacy Shield: A pocket guide, second edition

This concise guide is essential reading for US organizations wanting an easy to follow overview of the GDPR and the compliance obligations for handling data of EU citizens, including guidance on the EU-U.S. Privacy Shield.

Since the dawn of analytics we have strived to improve both the quality and volume of our data, with no other ambition to ensure the largest possible dataset – not because we need it, but because we might need it. GDPR have temporarily put a wrench in our original approach, but it takes more that the law to keep a good analyst away from his data and with Machine Learning as an active part of the toolbox the value of data have grown exponentially. The sessions in a reflection on how we have done things so far and where we might end if you don’t stop doing business as usual and instead calibrate our efforts in a more strategic and ethical direction.

IBM Spectrum Scale and IBM StoredIQ: Identifying and securing your business data to support regulatory requirements

Having the appropriate storage for hosting business critical data and the proper analytic software for deep inspection of that data is becoming necessary to get deeper insights into the data so that users can categorize which data qualifies for compliance. This IBM® Redpaper™ publication explains why the storage features of IBM Spectrum™ Scale, when combined with the data analysis and categorization features of IBM StoredIQ®, provide an excellent platform for hosting unstructured business data that is subject to regulatory compliance guidelines, such as General Data Protection Regulation (GDPR). In this paper, we describe how IBM StoredIQ can be used to identify files that are stored in an IBM Spectrum Scale™ file system that include personal information, such as phone numbers. These files can be secured in another file system partition by encrypting those files by using IBM Spectrum Scale functions. Encrypting files prevents unauthorized access to those files because only users that can access the encryption key can decrypt those files. This paper is intended for chief technology officers, solution, and security architects and systems administrators.

podcast_episode
by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

Happy New Year! Sure. The ball has dropped in Times Square, and a new year means an opportunity to look forward. But, we wanted to take a quick look back first -- on the industry, on the podcast, and on the world in general. From GDPR to Bayesian statistics to machine learning and AI to... podcast (and #mattgershoffed) stickers, 2018 was, clearly, the Year of the Analyst. So keep analyzing! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Send us a text Happy holidays from the Making Data Simple team! Enjoy a rebroadcast of a conversation with Seth Dobrin, Vice President and Chief Data Officer for IBM Analytics, as he and Al explore the strategies and people your company needs to disrupt and succeed in the year ahead. Do you or your team members need new credentials to work in data? Seth also discusses what you need in your toolkit to be a data scientist at IBM.

Show Notes 00.30 Connect with Al Martin on Twitter and LinkedIn. 01.00 Connect with Seth Dobrin on Twitter and LinkedIn. 01.40 Read "What IBM looks for in a Data Scientist" by Seth Dobrin and Jean-Francois Puget. 06.00 Learn more about GDPR.  13.00 Learn more about master data management. 13.05 Learn more about unified governance and integration.  13.25 Learn more about machine learning.  14.00 Connect and learn more about Ginni Rometty.  14.40 Learn more about cognitive computing. 19.35 Connect with Rob Thomas on Twitter and LinkedIn. 21.00 Connect with Jean-Francois Puget on Twitter and LinkedIn. Follow @IBMAnalytics Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Securing SQL Server: DBAs Defending the Database

Protect your data from attack by using SQL Server technologies to implement a defense-in-depth strategy for your database enterprise. This new edition covers threat analysis, common attacks and countermeasures, and provides an introduction to compliance that is useful for meeting regulatory requirements such as the GDPR. The multi-layered approach in this book helps ensure that a single breach does not lead to loss or compromise of confidential, or business sensitive data. Database professionals in today’s world deal increasingly with repeated data attacks against high-profile organizations and sensitive data. It is more important than ever to keep your company’s data secure. Securing SQL Server demonstrates how developers, administrators and architects can all play their part in the protection of their company’s SQL Server enterprise. This book not only provides a comprehensive guide to implementing the security model in SQLServer, including coverage of technologies such as Always Encrypted, Dynamic Data Masking, and Row Level Security, but also looks at common forms of attack against databases, such as SQL Injection and backup theft, with clear, concise examples of how to implement countermeasures against these specific scenarios. Most importantly, this book gives practical advice and engaging examples of how to defend your data, and ultimately your job, against attack and compromise. What You'll Learn Perform threat analysis Implement access level control and data encryption Avoid non-reputability by implementing comprehensive auditing Use security metadata to ensure your security policies are enforced Mitigate the risk of credentials being stolen Put countermeasures in place against common forms of attack Who This Book Is For Database administrators who need to understand and counteract the threat of attacks against their company’s data, and useful for SQL developers and architects

EU GDPR - A Pocket Guide (European) second edition

This concise guide is essential reading for EU organisations wanting an easy to follow overview of the new regulation and the compliance obligations for handling data of EU citizens. The EU General Data Protection Regulation (GDPR) will unify data protection and simplify the use of personal data across the EU, and automatically supersedes member states domestic data protection laws. It will also apply to every organisation in the world that processes personal information of EU residents. The Regulation introduces a number of key changes for all organisations that process EU residents’ personal data. EU GDPR: A Pocket Guide provides an essential introduction to this new data protection law, explaining the Regulation and setting out the compliance obligations for EU organisations. This second edition has been updated with improved guidance around related laws such as the NIS Directive and the future ePrivacy Regulation. EU GDPR – A Pocket Guide sets out: A brief history of data protection and national data protection laws in the EU (such as the German BDSG, French LIL and UK DPA). The terms and definitions used in the GDPR, including explanations. The key requirements of the GDPR, including: Which fines apply to which Articles; The six principles that should be applied to any collection and processing of personal data; The Regulation’s applicability; Data subjects’ rights; Data protection impact assessments (DPIAs); The role of the data protection officer (DPO) and whether you need one; Data breaches, and the notification of supervisory authorities and data subjects; Obligations for international data transfers. How to comply with the Regulation, including: Understanding your data, and where and how it is used (e.g. Cloud suppliers, physical records); The documentation you need to maintain (such as statements of the information you collect and process, records of data subject consent, processes for protecting personal data); The “appropriate technical and organisational measures” you need to take to ensure your compliance with the Regulation. A full index of the Regulation, enabling you to find relevant Articles quickly and easily.

EU GDPR: A Pocket Guide, second edition

EU GDPR – A Pocket Guide, second edition provides an accessible overview of the changes you need to make in your organisation to comply with the new law. The EU General Data Protection Regulation unifies data protection across the EU. It applies to every organisation in the world that does business with EU residents. The Regulation introduces a number of key changes for organisations – and the change from DPA compliance to GDPR compliance is a complex one. New for the second edition: Updated to take into account the latest guidance from WP29 and ICO. Improved guidance around related laws such as the NIS Directive and the future ePrivacy Regulation. This pocket guide also sets out: A brief history of data protection and national data protection laws in the EU (such as the UK DPA, German BDSG and French LIL). The terms and definitions used in the GDPR, including explanations. The key requirements of the GDPR How to comply with the Regulation A full index of the Regulation, enabling you to find relevant Articles quickly and easily. This guide is the ideal resource for anyone wanting a clear, concise primer on the EU GDPR.

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

Introductions How did you get involved in the area of data engineering and data management? What is Snowplow Analytics and what problem were you trying to solve when you started the company? What is unique about customer event data from an ingestion and processing perspective? Challenges with properly matching up data between sources Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?

Cleanliness/accuracy

What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly? Can you describe the overall architecture of the ingest pipeline that Snowplow provides?

How has that architecture evolved from when you first started? What would you do differently if you were to start over today?

Ensuring appropriate use of enrichment sources What have been some of the biggest challenges encountered while building and evolving Snowplow? What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Alex

@alexcrdean on Twitter LinkedIn

Snowplow

@snowplowdata on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Snowplow

GitHub

Deloitte Consulting OpenX Hadoop AWS EMR (Elastic Map-Reduce) Business Intelligence Data Warehousing Google Analytics CRM (Customer Relationship Management) S3 GDPR (General Data Protection Regulation) Kinesis Kafka Google Cloud Pub-Sub JSON-Schema Iglu IAB Bots And Spiders List Heap Analytics

Podcast Interview

Redshift SnowflakeDB Snowplow Insights Googl

EU GDPR: A Pocket Guide, School's edition

The EU General Data Protection Regulation (GDPR) unifies data protection and unifies data protection across the EU. It applies to every organisation in the world that handles EU residents’ personal data – which includes schools. The Regulation introduces a number of key changes for schools – and the change from compliance with the Data Protection Act 1998 (DPA) to GDPR compliance is a complex one. We have revised our popular EU GDPR – A Pocket Guide to include specific expectations of and requirements for schools, and provide an accessible overview of the changes you need to make to comply with the Regulation. GDPR – A Pocket Guide Schools’ Edition sets out: A brief history of data protection and national data protection laws in the EU, including as the UK’s DPA); Explanations of the terms and definitions used in the GDPR; The key requirements of the GDPR; The need to appoint a data protection officer (DPO); The lawful basis of processing data and when consent is needed; How to comply with the Regulation; and A full index of the Regulation, enabling you to find relevant articles quickly and easily. This pocket guide is the ideal resource for anyone wanting a clear, concise primer on the GDPR.

Summary

There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use

Interview

Introduction How did you get involved in the area of data security? Can you start by explaining what your mission is with Enveil and how the company got started? One of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it?

What are some of the challenges associated with scaling homomorphic encryption? What are some difficulties associated with working on encrypted data sets?

Can you describe the underlying architecture for your data platform?

How has that architecture evolved from when you first began building it?

What are some use cases that are unlocked by having a fully encrypted data platform? For someone using the Enveil platform, what does their workflow look like? A major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors? What are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns) What do you have planned for the future of Enveil?

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data security today?

Links

Enveil NSA GDPR Intellectual Property Zero Trust Homomorphic Encryption Ciphertext Hadoop PII (Personally Identifiable Information) TLS (Transport Layer Security) Spark Elasticsearch Side-channel attacks Spectre and Meltdown

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

podcast_episode
by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Jodi Daniels (Red Clover Advisors) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

For this mini-episode, Tim sits down (virtually) with Episode #077 guest Jodi Daniels from Red Clover Advisors to chat about the world of privacy post-May 25, 2018 (GDPR), as well as the upcoming California Consumer Privacy Act of 2018 (CCPA). Has the free-wheeling world of free-flowing consumer data ended, or are companies simply learning how to behave with more care? Give it a listen to find out?

In this podcast, @RobertoMaranca shared his thoughts on running a large data-driven organization. He shared his thoughts on the future of data organizations through compliance and privacy. He shared how businesses could survive policy like GDPR and prepare themselves for better data transparency and visibility. This podcast is great for leadership, leading a transnational corporation.

TIMELINE: 0:28 Roberto's journey. 8:18 Best practices as a data steward. 16:58 Data leadership and GDPR. 22:18 Impact of GDPR. 25:34 GDPR creating better knowledge archive. 29:27 GDPR and IOT infrastructure. 35:08 Shadow IT phenomenon and consumer privacy. 44:54 Suggestions for enterprises to deal with privacy disruption. 50:52 Data debt. 53:10 Opportunities in new privacy frameworks. 57:52 Roberto's success mantra. 1:02:38 Roberto's favorite reads.

Roberto's Recommended Read: Team of Teams: New Rules of Engagement for a Complex World by General Stanley McChrystal and Tantum Collins https://amzn.to/2kUxW1K Do Androids Dream of Electric Sheep?: The inspiration for the films Blade Runner and Blade Runner 2049 by Philip K. Dick https://amzn.to/2xOOpxZ A Scanner Darkly by Philip K. Dick https://amzn.to/2sAsUMs Other Philip K. Dick Books @ https://amzn.to/2JBwwY0

Podcast Link: https://futureofdata.org/data-leadership-through-privacy-gdpr-by-robertomaranca/

Roberto's BIO: With almost 25 years of experience in the world of IT and Data, Roberto has spent most its working life with General Electric in their Capital Division, where since 2014, as Chief Data Officer for their International Unit, he has been overlooking the implementation of the Data Governance and Quality frameworks, spanning from supporting risk model validation to enabling divestitures and leading their more recent Basel III data initiatives. For the last year, he has held the role of Chief Data Officer at Lloyds Banking Group, shaping and implementing a new Data Strategy and dividing his time between BCBS 239 and GDPR programs.

Roberto has got a Master’s Degree in Aeronautical Engineering from “Federico II” Naples University.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Summary

Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what NiFi is? What is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code? How did you get involved with the project?

Where does it sit in the broader landscape of data tools?

Does the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)?

How do you manage versioning and backup of data flows, as well as promoting them between environments?

One of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed?

What types of reporting are available across this information?

What are some of the use cases or requirements that lend themselves well to being solved by NiFi?

When is NiFi the wrong choice?

What is involved in deploying and scaling a NiFi installation?

What are some of the system/network parameters that should be considered? What are the scaling limitations?

What have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community? What do you have planned for the future of NiFi?

Contact Info

Kevin Doran

@kevdoran on Twitter Email

Andy LoPresto

@yolopey on Twitter Email

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

NiFi HortonWorks DataFlow HortonWorks Apache Software Foundation Apple CSV XML JSON Perl Python Internet Scale Asset Management Documentum DataFlow NSA (National Security Agency) 24 (TV Show) Technology Transfer Program Agile Software Development Waterfall Spark Flink Kafka Oozie Luigi Airflow FluentD ETL (Extract, Transform, and Load) ESB (Enterprise Service Bus) MiNiFi Java C++ Provenance Kubernetes Apache Atlas Data Governance Kibana K-Nearest Neighbors DevOps DSL (Domain Specific Language) NiFi Registry Artifact Repository Nexus NiFi CLI Maven Archetype IoT Docker Backpressure NiFi Wiki TLS (Transport Layer Security) Mozilla TLS Observatory NiFi Flow Design System Data Lineage GDPR (General Data Protection Regulation)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Summary

Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data

Interview

Introduction How did you get involved in the area of data management? Can you start by giving a brief overview of Heap? One of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data? Can you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there? Data collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information?

What are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage?

What is your approach for merging and enriching event data with the information that you retrieve from your supported integrations?

What challenges does that pose in your processing architecture?

What are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data?

How has that architecture changed or evolved over the life of the company? What are some changes that you are anticipating in the near future?

Can you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails? What are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap? What changes have been necessary as a result of GDPR? What are your plans for the future of Heap?

Contact Info

@danlovesproofs on twitter [email protected] @drob on github heapanalytics.com / @heap on twitter https://heapanalytics.com/blog/category/engineering?utm_source=rss&utm_medium=rss

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data manageme