Customer Analytics At Scale With Segment

2019-03-04 · Data Engineering Podcast Listen

podcast_episode

by Calvin French-Owen (Segment) , Tobias Macey

AI/ML Analytics Big Data Data Engineering Data Management Data Science Data Streaming

Summary Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Your host is Tobias Macey and today I’m interviewing Calvin French-Owen about the data platform that Segment has built to handle multiplexing continuous streams of data from multiple sources to multiple destinations

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Segment is and how the business got started?

What are some of the primary ways that your customers are using the Segment platform? How have the capabilities and use cases of the Segment platform changed since it was first launched?

Layered on top of the data integration platform you have added the concepts of Protocols and Personas. Can you explain how each of those products fit into the over

Walmart and the CICS Asynchronous API: An Adoption Experience

2019-03-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Frank De Gilio Pradeep Gohil Nick Garrod, Randy Frerking, Rich Jackson, Kellie Mathis

IBM data data-engineering

Abstract This IBM® Redbooks® publication discusses practical uses of the IBM CICS asynchronous API capability. It describes the methodology, design and thought process used by a large client, Walmart, and the considerations of the choices made. The Redbooks publication provides real life examples and application patterns that benefit from the performance and scalability offered by the new API. The book discusses the homegrown methodology used by Walmart before the API was available and compares it with the design using the new API. A discussion of the process used to migrate older applications to begin using the new API is included so the reader will understand the ease of implementing the new API. A description of real world usage patterns describes the current production application Walmart has deployed as well as other patterns to give the reader a sense of what's possible applying creative thinking with technology improvements. Finally, a section is included on the areas to be considered as you begin to plan and implement asynchronous API capabilities. This book should be read by: Enterprise Architects searching for faster ways to service strategic applications across the enterprise. Solution Architects who want to better understand implementation possibilities for improved response times and better performance for CICS applications. CICS programmers looking to modernize and provide improved response times.

Speed Up Your Analytics With The Alluxio Distributed Storage System

2019-02-19 · Data Engineering Podcast Listen

podcast_episode

by Bin Fan (Alluxio) , Tobias Macey

Analytics Big Data Cloud Computing Data Engineering Data Management

Summary Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data. In this episode Bin Fan explains how he got involved with the project, how it is implemented, and the use cases that it is particularly well suited for. If your storage and compute layers are too tightly coupled and you want to scale them independently then Alluxio is the tool for the job.

Introduction

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Bin Fan about Alluxio, a distributed virtual filesystem for unified access to disparate data sources

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Alluxio is and the history of the project?

What are some of the use cases that Alluxio enables?

How is Alluxio implemented and how has its architecture evolved over time?

What are some of the techniques that you use to mitigate the impact of latency, particularly when interfacing with storage systems across cloud providers and private data centers?

When dealing with large volumes of data over time it is often necessary to age out older records to cheaper storage. What capabilities does Alluxio provide for that lifecycle management? What are some of the most complex or challenging aspects of providing a unified abstraction across disparate storage platforms?

What are the tradeoffs that are made to provide a single API across systems with varying capabilities?

Testing and verification of distributed systems is a complex undertaking. Can you describe the approach that you use to ensure proper functionality of Alluxio as part of the development and release process?

In order to allow for this large scale testing with any regularity it must be straightforward to deploy and configure Alluxio. What are some of the mechanisms that you have built into the platform to simplify the operational aspects?

Can you describe a typical system topology that incorporates Alluxio? For someone planning a deployment of Alluxio, what should they be considering in terms of system requirements and deployment topologies?

What are some edge cases or operational complexities that they should be aware of?

What are some cases where Alluxio is the wrong choice?

What are some projects or products that provide a similar capability to Alluxio?

What do you have planned for the future of the Alluxio project and company?

Contact Info

LinkedIn @binfan on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Alluxio

Project Company

Carnegie Me

THE POTENTIAL OF A WELL STRUCTURED BUSINESS DATA FEED

2019-02-01 · Superweek 2019

talk

by Zoran Arsovski, Ivaylo Shipochky (/ VertoDigital)

Looking closer at a real case study, we will put the data feed in the spotlight and look at how we leverage Google Ads API, Bing Ads, Google Campaign Manager and Google Search Ads 360 to fully automate advertising campaign setup and optimization.

Apache Spark Quick Start Guide

2019-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Akash Grade , Shrey Mehrotra

AI/ML Big Data Java Python Scala Spark SQL Data Streaming apache-spark data data-engineering

Dive into the world of scalable data processing with the "Apache Spark Quick Start Guide." This book offers a foundational introduction to Spark, empowering readers to harness its capabilities for big data processing. With clear explanations and hands-on examples, you'll learn to implement Spark applications that handle complex data tasks efficiently. What this Book will help me do Understand and implement Spark's RDDs and DataFrame APIs to process large datasets effectively. Set up a local development environment for Spark-based projects. Develop skills to debug and optimize slow-performing Spark applications. Harness built-in modules of Spark for SQL, streaming, and machine learning applications. Adopt best practices and optimization techniques for high-performance Spark applications. Author(s) Shrey Mehrotra is a seasoned software developer with expertise in big data technologies, particularly Apache Spark. With years of hands-on industry experience, Shrey focuses on making complex technical concepts accessible to all. Through his writing, he aims to share clear, practical guidance for developers of all levels. Who is it for? This guide is perfect for big data enthusiasts and professionals looking to learn Apache Spark's capabilities from scratch. It's aimed at data engineers interested in optimizing application performance and data scientists wanting to integrate machine learning with Spark. A basic familiarity with either Scala, Python, or Java is recommended.

Java XML and JSON: Document Processing for Java SE

2019-01-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jeff Friesen

Java JSON Oracle XML data data-engineering storage-formats

Use this guide to master the XML metalanguage and JSON data format along with significant Java APIs for parsing and creating XML and JSON documents from the Java language. New in this edition is coverage of Jackson (a JSON processor for Java) and Oracle’s own Java API for JSON processing (JSON-P), which is a JSON processing API for Java EE that also can be used with Java SE. This new edition of Java XML and JSON also expands coverage of DOM and XSLT to include additional API content and useful examples. All examples in this book have been tested under Java 11. In some cases, source code has been simplified to use Java 11’s var language feature. The first six chapters focus on XML along with the SAX, DOM, StAX, XPath, and XSLT APIs. The remaining six chapters focus on JSON along with the mJson, GSON, JsonPath, Jackson, and JSON-P APIs. Each chapter ends with select exercises designed to challenge your grasp of the chapter's content.An appendix provides the answers to these exercises. What You'll Learn Master the XML language Create, validate, parse, and transform XML documents Apply Java’s SAX, DOM, StAX, XPath, and XSLT APIs Master the JSON format for serializing and transmitting data Code against third-party APIs such as Jackson, mJson, Gson, JsonPath Master Oracle’s JSON-P API in a Java SE context Who This Book Is For Intermediate and advanced Java programmers who are developing applications that must access data stored in XML or JSON documents. The book also targets developers wanting to understand the XML language and JSON data format.

Practical Apache Spark: Using the Scala API

2018-12-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dharanitharan Ganesan , Subhashini Chellappan

AI/ML Hive Kafka Scala Spark SQL Data Streaming apache-spark data data-engineering

Work with Apache Spark using Scala to deploy and set up single-node, multi-node, and high-availability clusters. This book discusses various components of Spark such as Spark Core, DataFrames, Datasets and SQL, Spark Streaming, Spark MLib, and R on Spark with the help of practical code snippets for each topic. Practical Apache Spark also covers the integration of Apache Spark with Kafka with examples. You’ll follow a learn-to-do-by-yourself approach to learning – learn the concepts, practice the code snippets in Scala, and complete the assignments given to get an overall exposure. On completion, you’ll have knowledge of the functional programming aspects of Scala, and hands-on expertise in various Spark components. You’ll also become familiar with machine learning algorithms with real-time usage. What You Will Learn Discover the functional programming features of Scala Understand the completearchitecture of Spark and its components Integrate Apache Spark with Hive and Kafka Use Spark SQL, DataFrames, and Datasets to process data using traditional SQL queries Work with different machine learning concepts and libraries using Spark's MLlib packages Who This Book Is For Developers and professionals who deal with batch and stream data processing.

Programming Skills for Data Science: Start Writing Code to Wrangle, Analyze, and Visualize Data with R, First Edition

2018-11-19 · O'Reilly Data Science Books O'Reilly Amazon

book

by Michael Freeman , Joel Ross

Data Science Git GitHub R data data-science

The Foundational Hands-On Skills You Need to Dive into Data Science “Freeman and Ross have created the definitive resource for new and aspiring data scientists to learn foundational programming skills.” –From the foreword by Jared Lander, series editor Using data science techniques, you can transform raw data into actionable insights for domains ranging from urban planning to precision medicine. brings together all the foundational skills you need to get started, even if you have no programming or data science experience. Programming Skills for Data Science Leading instructors Michael Freeman and Joel Ross guide you through installing and configuring the tools you need to solve professional-level data science problems, including the widely used R language and Git version-control system. They explain how to wrangle your data into a form where it can be easily used, analyzed, and visualized so others can see the patterns you've uncovered. Step by step, you'll master powerful R programming techniques and troubleshooting skills for probing data in new ways, and at larger scales. Freeman and Ross teach through practical examples and exercises that can be combined into complete data science projects. Everything's focused on real-world application, so you can quickly start analyzing your own data and getting answers you can act upon. Learn to Install your complete data science environment, including R and RStudio Manage projects efficiently, from version tracking to documentation Host, manage, and collaborate on data science projects with GitHub Master R language fundamentals: syntax, programming concepts, and data structures Load, format, explore, and restructure data for successful analysis Interact with databases and web APIs Master key principles for visualizing data accurately and intuitively Produce engaging, interactive visualizations with ggplot and other R packages Transform analyses into sharable documents and sites with R Markdown Create interactive web data science applications with Shiny Collaborate smoothly as part of a data science team Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

2018-11-11 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Yoni Iny (Upsolver)

Avro CloudWatch Kinesis Cassandra Cloud Computing CSV Data Engineering Data Lake Data Management Datadog DevOps DWH +13 more

Summary

A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Upsolver is and how it got started?

What are your goals for the platform?

There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?

What are the shortcomings of a data lake architecture?

How is Upsolver architected?

How has that architecture changed over time? How do you manage schema validation for incoming data? What would you do differently if you were to start over today?

What are the biggest challenges at each of the major stages of the data lake? What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake? When is Upsolver the wrong choice for an organization considering implementation of a data platform? Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house? What features or improvements do you have planned for the future of Upsolver?

Contact Info

Yoni

yoniiny on GitHub LinkedIn

Upsolver

Website @upsolver on Twitter LinkedIn Facebook

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Upsolver Data Lake Israeli Army Data Warehouse Data Engineering Podcast Episode About Data Curation Three Vs Kafka Spark Presto Drill Spot Instances Object Storage Cassandra Redis Latency Avro Parquet ORC Data Engineering Podcast Episode About Data Serialization Formats SSTables Run Length Encoding CSV (Comma Separated Values) Protocol Buffers Kinesis ETL DevOps Prometheus Cloudwatch DataDog InfluxDB SQL Pandas Confluent KSQL

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

2018-11-05 · Data Engineering Podcast Listen

podcast_episode

by Daniel Mintz (Looker) , Tobias Macey

AI/ML Airflow Athena BI BigQuery Data Engineering Data Management DevOps DWH ETL/ELT Hadoop Hive +10 more

Summary

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Looker is and the problem that it is aiming to solve?

How do you define business intelligence?

How is Looker unique from other approaches to business intelligence in the enterprise?

How does it compare to open source platforms for BI?

Can you describe the technical infrastructure that supports Looker? Given that you are connecting to the customer’s data store, how do you ensure sufficient security? For someone who is using Looker, what does their workflow look like?

How does that change for different user roles (e.g. data engineer vs sales management)

What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency? What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?

What are the portions of the Looker architecture that you would do differently if you were to start over today?

What are some of the most interesting or unusual uses of Looker that you have seen? What is in store for the future of Looker?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Looker Upworthy MoveOn.org LookML SQL Business Intelligence Data Warehouse Linux Hadoop BigQuery Snowflake Redshift DB2 PostGres ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Airflow Luigi NiFi Data Curation Episode Presto Hive Athena DRY (Don’t Repeat Yourself) Looker Action Hub Salesforce Marketo Twilio Netscape Navigator Dynamic Pricing Survival Analysis DevOps BigQuery ML Snowflake Data Sharehouse

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

2018-10-29 · Data Engineering Podcast Listen

podcast_episode

by Matthew Seal (Netflix) , Tobias Macey

Data Engineering Data Management GitHub Python Scala

Summary

Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles

Interview

Introduction How did you get involved in the area of data management? Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?

Where are you using notebooks and where are you not?

What is the technical infrastructure that you have built to suppport that design choice? Which team was driving the effort?

Was it difficult to get buy in across teams?

How much shared code have you been able to consolidate or reuse across teams/roles? Have you investigated the use of any of the other notebook platforms for similar workflows? What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them? What are some of the limitations of the notebook environment for the work that you are doing? What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks? What are some of the projects that are ongoing or planned for the future that you are most excited by?

Contact Info

Matthew Seal

Email LinkedIn @codeseal on Twitter MSeal on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Netflix Notebook Blog Posts Nteract Tooling OpenGov Project Jupyter Zeppelin Notebooks Papermill Titus Commuter Scala Python R Emacs NBDime

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Matplotlib 3.0 Cookbook

2018-10-23 · O'Reilly Data Science Books O'Reilly Amazon

book

by Srinivasa Rao Poladi , Nikhil Borkar

Data Science Matplotlib Python data data-science data-science-tasks data-visualization python-viz-tools

Matplotlib 3.0 Cookbook is your go-to guide for mastering the Matplotlib library in Python for creating a wide range of data visualizations. Through 150+ practical recipes, you will learn how to design intuitive and detailed charts, graphs, and dashboards, navigating from simple plots to advanced interactive and 3D visualizations. What this Book will help me do Develop professional-quality data visualizations using Matplotlib. Leverage Matplotlib's API for both quick plotting and advanced customization. Create interactive and animative plots for engaging data representation. Extend Matplotlib functionalities with toolkits like cartopy and axisartist. Integrate Matplotlib figures into GUI applications for broader usage. Author(s) None Poladi and None Borkar are experienced Python developers and enthusiasts who have collaborated in creating a resourceful guide to Matplotlib. They bring extensive experience in data science visualization and Python programming. Their collaborative effort ensures clarity and an approachable learning curve for anyone delving into graphical data representation using Matplotlib. Who is it for? This book is ideal for data scientists, Python developers, and visualization enthusiasts eager to enhance their technical plotting skills. The content covers both fundamentals and advanced topics, suitable for users ranging from beginners curious about Python visualization to experts seeking streamlined workflows and advanced techniques.

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.init) - Episode 53

2018-10-22 · Data Engineering Podcast Listen

podcast_episode

by Emily Miller (Driven Data) , Peter Bull (Driven Data) , Tobias Macey

Data Engineering Data Management Data Science GitHub Pandas Python

Summary

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.init, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com

Interview

Introductions How did you get introduced to Python? Can you start by describing what Deon is and your motivation for creating it? Why a checklist, specifically? What’s the advantage of this over an oath, for example? What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering? What is the typical workflow for a team that is using Deon in their projects? Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?

Have you received pushback on any of the default items?

How does Deon simplify communication around ethics across team boundaries? What are some of the most often overlooked items? What are some of the most difficult ethical concerns to comply with for a typical data science project? How has Deon helped you at Driven Data? What are the customer facing impacts of embedding a discussion of ethics in the product development process? Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced? What are your hopes for the future of the Deon project?

Keep In Touch

Emily

LinkedIn ejm714 on GitHub

Peter

LinkedIn @pjbull on Twitter pjbull on GitHub

Driven Data

@drivendataorg on Twitter drivendataorg on GitHub Website

Picks

Tobias

Richard Bond Glass Art

Emily

Tandem Coffee in Portland, Maine

Peter

The Model Bakery in Saint Helena and Napa, California

Links

Deon Driven Data International Development Brookings Institution Stata Econometrics Metis Bootcamp Pandas

Podcast Episode

C# .NET Podcast.init Episode On Software Ethics Jupyter Notebook

Podcast Episode

Word2Vec cookiecutter data science Logistic Regression

The intro and outro music is

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

2018-10-15 · Data Engineering Podcast Listen

podcast_episode

by Ryan Blue (Tabular) , Tobias Macey

Avro Big Data Cloud Computing Data Engineering Data Lake Data Management Data Modelling Git GitHub Hadoop HDFS Hive +6 more

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Iceberg is and the motivation for creating it?

Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?

How has the use of Iceberg simplified your work at Netflix? How is the reference implementation architected and how has it evolved since you first began work on it?

What is involved in deploying it to a user’s environment?

For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?

Is there a migration path for pre-existing tables into the Iceberg format?

How is schema evolution managed at the file level?

How do you handle files on disk that don’t contain all of the fields specified in a table definition?

One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard? What are the unique challenges posed by using S3 as the basis for a data lake?

What are the benefits that outweigh the difficulties?

What have been some of the most challenging or contentious details of the specification to define?

What are some things that you have explicitly left out of the specification?

What are your long-term goals for the Iceberg specification?

Do you anticipate the reference implementation continuing to be used and maintained?

Contact Info

rdblue on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Iceberg Reference Implementation Iceberg Table Specification Netflix Hadoop Cloudera Avro Parquet Spark S3 HDFS Hive ORC S3mper Git Metacat Presto Pig DDL (Data Definition Language) Cost-Based Optimization

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

2018-10-09 · Data Engineering Podcast Listen

podcast_episode

by Nikita Shamgunov (Neon) , Tobias Macey

AI/ML BI Cloud Computing Data Engineering Data Management Data Science DWH SQL Tableau

Summary One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by describing what MemSQL is and how the product and business first got started?What are the typical use cases for customers running MemSQL?What are the benefits of integrating the ingestion pipeline with the database engine? What are some typical ways that the ingest capability is leveraged by customers?How is MemSQL architected and how has the internal design evolved from when you first started working on it?Where does it fall on the axes of the CAP theorem?How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?Can you describe the lifecycle of a write transaction?Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?How do you mitigate the impact of network latency throughout the cluster during query planning and execution?How much of the implementation of MemSQL is using custom built code vs. open source projects?What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?When is MemSQL the wrong choice for a data platform?What do you have planned for the future of MemSQL? Contact Info @nikitashamgunov on TwitterLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links MemSQLNewSQLMicrosoft SQL ServerSt. Petersburg University of Fine Mechanics And OpticsCC++In-Memory DatabaseRAM (Random Access Memory)Flash StorageOracle DBPostgreSQLPodcast EpisodeKafkaKinesisWealth ManagementData WarehouseODBCS3HDFSAvroParquetData Serialization Podcast EpisodeBroadcast JoinShuffle JoinCAP TheoremApache ArrowLZ4S2 Geospatial LibrarySybaseSAP HanaKubernetes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Understanding #BigData for #BigCities with Maksim ( @MrMaksimize @CityofSanDiego )

2018-10-04 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Maksim Pecherskiy (City of San Diego)

Agile/Scrum AI/ML Analytics Big Data CI/CD Data Science DevOps

In this podcast, Maksim, CDO @ City of San Diago, discussed the nuances of running big data for big cities. He shares his perspectives on effectively building a central data office in a complex and extremely collaborative environment like a big city. He shared his thoughts on some ways to effectively prioritize which project to pursue. He shared how leadership and execution could blend to solve civic issues relating to big and small cities. A great practitioner podcast for folks seeking to build a robust data science practice across a large and collaborative ecosystem.

Timeline: 0:28 Maksim's journey. 6:45 Maksim's current role. 11:46 Collaboration process in creating a data inventory. 14:52 Working with the bureaucracy. 18:35 Dealing with unforeseen circumstances at work. 20:22 Prioritization at work. 22:58 Qualities of a good data leader. 26:15 Collaboration with other cities. 27:40 Cool data projects in other cities. 30:55 Shortcomings of other city representatives. 36:54 Use cases in AI 39:00 What would Maksim change about himself? 40:50 Future cities and data 43:55 Opportunities for private investors in the public sector. 45:53 Maksim's success mantra. 50:19 Closing remark.

Maksim's Book Recommendation: The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win by Gene Kim, Kevin Behr, George Spafford amzn.to/2MAu5Xv

Podcast Link: https://futureofdata.org/understanding-bigdata-for-bigcities-with-maksim-mrmaksimize-cityofsandiego-futureofdata-podcast/

Maksim's BIO: Maksim Pecherskiy: As the CDO for the City of San Diego, working in the Performance & Analytics Department, Maksim strives to bring the necessary components together to allow the City's residents to benefit from a more efficient, agile government that is as innovative as the community around it. He has been solving complex problems with technology for nearly a decade. He spent 2014 working as a Code For America fellow in Puerto Rico, focusing on economic development. His team delivered a product called PrimerPeso that provides business owners and residents a tool to search, and apply for, government programs for which they may be eligible.

Before moving to California, Maksim was a Solutions Architect at Promet Source in Chicago, where he built large web applications and designed complex integrations. He shaped workflow, configuration management, and continuous integration processes while leading and training international development teams. Before his work at Promet, he was a software engineer at AllPlayers, who was instrumental in the design and architecture of its APIs and the development and documentation of supporting client libraries in various languages.

Maksim graduated from DePaul University with a bachelor of science degree in information systems and from Linköping University, Sweden, with a bachelor of science degree in international business. He is also certified as a Lean Six Sigma Green Belt.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join? If you or any you know wants to join in, Register your interest by mailing us @ [email protected]

Want to sponsor? Email us @ [email protected]

Keywords: FutureOfData,

DataAnalytics,

Leadership,

Futurist,

Podcast,

BigData,

Strategy

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

2018-10-01 · Data Engineering Podcast Listen

podcast_episode

by Chris Groskopf (Enigma) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Management Data Science ETL/ELT

Summary

There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph

Interview

Introduction How did you get involved in the area of data management? Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?

How do you define the concept of a knowledge graph?

What are the processes involved in constructing a knowledge graph? Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph? What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?

How do you manage the software lifecycle for your ETL code? What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?

What are the current challenges that you are facing in building and scaling your data infrastructure?

How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose? What techniques are you using to manage accuracy and consistency in the data that you ingest?

Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers? What are the weak spots in your platform that you are planning to address in upcoming projects?

If you were to start from scratch today, what would you have done differently?

What are some of the most interesting or unexpected uses of your product that you have seen? What is in store for the future of Enigma?

Contact Info

Email Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Enigma Chicago Tribune NPR Quartz CSVKit Aga

Redash v5 Quick Start Guide

2018-09-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Yael Leibzon , Alexander Leibzon

BI DataViz Redash Redis SQL data data-engineering nosql-databases

In the 'Redash v5 Quick Start Guide', you'll learn everything you need to master the Redash data visualization platform and confidently create compelling dashboards. This book covers how to connect to different data sources, use SQL to query data, and design and share insightful visualizations. What this Book will help me do Understand how to install, configure, and troubleshoot Redash for your data projects. Gain skills in managing user roles and permissions to ensure secure data collaboration. Learn to connect Redash to various data sources and fetch, process, and handle data. Master the creation of advanced visualizations to effectively present complex data. Develop proficiency in utilizing the Redash API for integrating programmatic interactions. Author(s) None Leibzon is a recognized expert in data visualization and Business Intelligence tools, with years of experience working with data-driven systems. Drawing from his deep practical knowledge of Redash and its applications, None has crafted this guide to be accessible and highly practical. His goal is to enable learners and professionals to unlock the power of data storytelling through intuitive and actionable visualization. Who is it for? If you're a Data Analyst, BI professional, or Data Developer with basic SQL skills, this book is tailored for you. It assumes no prior knowledge of Redash but benefits those who understand fundamental Business Intelligence concepts. Whether you're looking to create your first visualization or streamline data collaboration, this guide will help you achieve your goals.

D3.js Quick Start Guide

2018-09-27 · O'Reilly Data Science Books O'Reilly Amazon

book

by Matthew Huntington

DataViz JavaScript d3 data data-science data-science-tasks data-visualization

D3.js Quick Start Guide is your go-to resource for mastering D3.js, a powerful JavaScript library for creating interactive visualizations in the browser. This book walks you through core concepts, from building scatter plots to creating force-directed graphs, helping you go from beginner to creating stunning visual data representations. What this Book will help me do Create interactive scatter plots showcasing data relationships. Implement bar graphs that dynamically update from API data. Design animated pie charts for visually appealing representations. Develop force-directed graphs to represent networked data. Leverage GeoJSON data for building informative interactive maps. Author(s) None Huntington is an experienced web developer with a clear knack for turning complex topics into understandable concepts. With expertise in data visualization and web technologies, Huntington explains technical subject matter in a friendly and approachable manner, ensuring learners grasp both theoretical and practical aspects effectively. Who is it for? This book is ideal for web developers and data enthusiasts eager to learn how to represent data via interactive visualizations using D3.js. If you have a basic understanding of JavaScript and are looking to enhance your web development skillset with dynamic visualization techniques, this guide is perfect for you. Through easy-to-follow examples, you'll get up to speed quickly and start building professional-looking visualizations right away. Whether you're a data scientist, interactive news developer, or just interested in bringing data to life, this book is your key to mastering D3.js.

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

2018-09-24 · Data Engineering Podcast Listen

podcast_episode

by Todd Walter , Tobias Macey

AI/ML Big Data Cloud Computing Data Engineering Data Lake Data Management Data Science DWH ETL/ELT

Summary As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence

Interview

Introduction How did you get involved in the area of data management? How do you define data curation?

What are some of the high level concerns that are encapsulated in that effort?

How does the size and maturity of a company affect the ways that they architect and interact with their data systems? Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it? What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure? What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space? As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep? In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?

What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?

Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure? ETL has long been the default approac

talk-data.com

API

Activity Trend

Top Events

Top Speakers

Customer Analytics At Scale With Segment

Walmart and the CICS Asynchronous API: An Adoption Experience

Speed Up Your Analytics With The Alluxio Distributed Storage System

THE POTENTIAL OF A WELL STRUCTURED BUSINESS DATA FEED

Apache Spark Quick Start Guide

Java XML and JSON: Document Processing for Java SE

Practical Apache Spark: Using the Scala API

Programming Skills for Data Science: Start Writing Code to Wrangle, Analyze, and Visualize Data with R, First Edition

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

Matplotlib 3.0 Cookbook

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.init) - Episode 53

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

Understanding #BigData for #BigCities with Maksim ( @MrMaksimize @CityofSanDiego )

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

Redash v5 Quick Start Guide

D3.js Quick Start Guide

A Primer On Enterprise Data Curation with Todd Walter - Episode 49