talk-data.com talk-data.com

Topic

AWS

Amazon Web Services (AWS)

cloud cloud provider infrastructure services

837

tagged

Activity Trend

190 peak/qtr
2020-Q1 2026-Q1

Activities

837 activities · Newest first

Multicloud Storage as a Service using vRealize Automation and IBM Spectrum Storage

This document is intended to facilitate the deployment of the Multicloud Solution for Business Continuity and Storage as service by using IBM Spectrum Virtualize for Public Cloud on Amazon Web Services (AWS). To complete the tasks it describes, you must understand IBM FlashSystem 9100, IBM Spectrum Virtualize for Public Cloud, IBM Spectrum Connect, VMware vRealize Orchestrator, and vRealize Automation and AWS Cloud. The information in this document is distributed on an "as is" basis without any warranty that is either expressed or implied. Support assistance for the use of this material is limited to situations where IBM Storwize or IBM FlashSystem storage devices are supported and entitled and where the issues are specific to a blueprint implementation.

Applied Supervised Learning with R

Applied Supervised Learning with R equips you with the essential knowledge and practical skills to leverage machine learning techniques for solving business problems using R. With this book, you'll gain hands-on experience in implementing various supervised learning models, assessing their performance, and selecting the best-suited method for your objectives. What this Book will help me do Gain expertise in identifying and framing business problems suitable for supervised learning. Acquire skills in data wrangling and visualization using R packages like dplyr and ggplot2. Master techniques for tuning hyperparameters to optimize machine learning models. Understand methods for feature selection and dimensionality reduction to enhance model performance. Learn how to deploy machine learning models to production environments, such as AWS Lambda. Author(s) Karthik Ramasubramanian and Jojo Moolayil are both seasoned data science practitioners and educators who bring a wealth of experience in machine learning and analytics. With a deep understanding of R and its applications in real-world scenarios, they offer practical insights and actionable examples to their readers. Their teaching style focuses on clarity and practical application. Who is it for? This book is ideal for data analysts, data scientists, and data engineers at a beginner to intermediate level who aim to master supervised machine learning with R. Readers should have basic knowledge of statistics, probabilities, and R programming. It is designed for those eager to apply machine learning techniques to real-world problems and improve their decision-making capabilities.

Summary Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave

Summary In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Drew Banin about DBT, the Data Build Tool, a toolkit for building analytics the way that developers build applications

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what DBT is and your motivation for creating it? Where does it fit in the overall landscape of data tools and the lifecycle of data in an analytics pipeline? Can you talk through the workflow for someone using DBT? One of the useful features of DBT for stability of analytics is the ability to write and execute tests. Can you explain how those are implemented? The packaging capabilities are beneficial for enabling collaboration. Can you talk through how the packaging system is implemented?

Are these packages driven by Fishtown Analytics or the dbt community?

What are the limitations of modeling everything as a SELECT statement? Making SQL code reusable is notoriously difficult. How does the Jinja templating of DBT address this issue and what are the shortcomings?

What are your thoughts on higher level approaches to SQL that compile down to the specific statements?

Can you explain how DBT is implemented and how the design has evolved since you first began working on it? What are some of the features of DBT that are often overlooked which you find particularly useful? What are some of the most interesting/unexpected/innovative ways that you have seen DBT used? What are the additional features that the commercial version of DBT provides? What are some of the most useful or challenging lessons that you have learned in the process of building and maintaining DBT? When is it the wrong choice? What do you have planned for the future of DBT?

Contact Info

Email @drebanin on Twitter drebanin on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

DBT Fishtown Analytics 8Tracks Internet Radio Redshift Magento Stitch Data Fivetran Airflow Business Intelligence Jinja template language BigQuery Snowflake Version Control Git Continuous Integration Test Driven Development Snowplow Analytics

Podcast Episode

dbt-utils We Can Do Better Than SQL blog post from EdgeDB EdgeDB Looker LookML

Podcast Interview

Presto DB

Podcast Interview

Spark SQL Hive Azure SQL Data Warehouse Data Warehouse Data Lake Data Council Conference Slowly Changing Dimensions dbt Archival Mode Analytics Periscope BI dbt docs dbt repository

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ryan Worl about FoundationDB, a distributed key/value store that gives you t

Summary Kubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing a simple mechanism for running your storage system in the same platform as your application. In this episode Tamal Saha explains how the KubeDB project got started, why you might want to run your database with Kubernetes, and how to get started. He also covers some of the challenges of managing stateful services in Kubernetes and how the fast pace of the community has contributed to the evolution of KubeDB. If you are at any stage of a Kubernetes implementation, or just thinking about it, this is definitely worth a listen to get some perspective on how to leverage it for your entire application stack.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your fri

Summary One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Evan Weaver about FaunaDB, a modern operational data platform built for your cloud

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what FaunaDB is and how it got started? What are some of the main use cases that FaunaDB is targeting?

How does it compare to some of the other global scale databases that have been built in recent years such as CockroachDB?

Can you describe the architecture of FaunaDB and how it has evolved? The consensus and replication protocol in Fauna is intriguing. Can you talk through how it works?

What are some of the edge cases that users should be aware of? How are conflicts managed in Fauna?

What is the underlying storage layer?

How is the query layer designed to allow for different query patterns and model representations?

How does data modeling in Fauna compare to that of relational or document databases?

Can you describe the query format? What are some of the common difficulties or points of confusion around interacting with data in Fauna?

What are some application design patterns that are enabled by using Fauna as the storage layer? Given the ability to replicate globally, how do you mitigate latency when interacting with the database? What are some of the most interesting or unexpected ways that you have seen Fauna used? When is it the wrong choice? What have been some of the most interesting/unexpected/challenging aspects of building the Fauna database and company? What do you have in store for the future of Fauna?

Contact Info

@evan on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Fauna Ruby on Rails CNET GitHub Twitter NoSQL Cassandra InnoDB Redis Memcached Timeseries Spanner Paper DynamoDB Paper Percolator ACID Calvin Protocol Daniel Abadi LINQ LSM Tree (Log-structured Merge-tree) Scala Change Data Capture GraphQL

Podcast.init Interview About Graphene

Fauna Query Language (FQL) CQL == Cassandra Query Language Object-Relational Databases LDAP == Lightweight Directory Access Protocol Auth0 OLAP == Online Analytical Processing Jepsen distributed systems safety research

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable high-speed aggregate analysis. In this episode Seebs explains how Pilosa fits in the broader data landscape, how it is architected, and how you can start using it for your own analysis. This was an interesting exploration of a different way to look at what a database can be.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Seebs about Pilosa, an open source, distributed bitmap index

Interview

Introduction How did you get involved in the area of data

Summary

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.

Introduction

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year

Interview

Introduction How did you get involved in the area of data management? Can you refresh our memory about what TimescaleDB is? How has the market for timeseries databases changed since we last spoke? What has changed in the focus and features of the TimescaleDB project and company? Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?

What were the most challenging aspects of reaching that goal?

In terms of timeseries workloads, what are some of the factors that differ across varying use cases?

How do those differences impact the ways in which Timescale is used by the end user, and built by your team?

What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven? How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?

Have you been able to leverage some of the native improvements to simplify your implementation? Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?

What is in store for the future of the Timescale product and organization?

Contact Info

Ajay

@acoustik on Twitter LinkedIn

Mike

LinkedIn Website @michaelfreedman on Twitter

Timescale

Website Documentation Careers timescaledb on GitHub @timescaledb on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

TimescaleDB Original Appearance on the Data Engineering Podcast 1.0 Release Blog Post PostgreSQL

Podcast Interview

RDS DB-Engines MongoDB IOT (Internet Of Things) AWS Timestream Kafka Pulsar

Podcast Episode

Spark

Podcast Episode

Flink

Podcast Episode

Hadoop DevOps PipelineDB

Podcast Interview

Grafana Tableau Prometheus OLTP (Online Transaction Processing) Oracle DB Data Lake

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

podcast_episode
by Val Kroll , Julie Hoyer , Simo Ahava (NetBooster, Helsinki - Finland) , Tim Wilson (Analytics Power Hour - Columbus (OH) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

Are you deeply knowledgable in JavaScript, R, the DOM, Python, AWS, jQuery, Google Cloud Platform, and SQL? Good for you! If you're not, should you be? What does "technical" mean, anyway? And, is it even possible for an analyst to dive into all of these different areas? English philosophy expert The Notorious C.M.O. (aka, Simo Ahava) returns to the show to share his thoughts on the subject in this episode. For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

Introductions How did you get involved in the area of data engineering and data management? What is Snowplow Analytics and what problem were you trying to solve when you started the company? What is unique about customer event data from an ingestion and processing perspective? Challenges with properly matching up data between sources Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?

Cleanliness/accuracy

What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly? Can you describe the overall architecture of the ingest pipeline that Snowplow provides?

How has that architecture evolved from when you first started? What would you do differently if you were to start over today?

Ensuring appropriate use of enrichment sources What have been some of the biggest challenges encountered while building and evolving Snowplow? What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Alex

@alexcrdean on Twitter LinkedIn

Snowplow

@snowplowdata on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Snowplow

GitHub

Deloitte Consulting OpenX Hadoop AWS EMR (Elastic Map-Reduce) Business Intelligence Data Warehousing Google Analytics CRM (Customer Relationship Management) S3 GDPR (General Data Protection Regulation) Kinesis Kafka Google Cloud Pub-Sub JSON-Schema Iglu IAB Bots And Spiders List Heap Analytics

Podcast Interview

Redshift SnowflakeDB Snowplow Insights Googl

Summary

Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it?

What types of data are you focused on supporting? What are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data?

Is there any need for an Elasticsearch cluster in addition to Chaos Search? For someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3? What are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL? Given that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS? What mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster? What is the system architecture that you have built to allow for querying terabytes of data in S3?

What are the biggest contributors to query latency and what have you done to mitigate them?

What are the options for access control when running queries against the data stored in S3? What are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen? What are your plans for the future of Chaos Search?

Contact Info

Pete Cheslock

@petecheslock on Twitter Website

Thomas Hazel

@thomashazel on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tool

Summary

The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. If you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data. Python has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience

as a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40! Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database.

Interview

Introduction How did you get involved in the area of data management? What is DGraph and what motivated you to build it? Graph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems?

The graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings?

What are some of the common uses of graph storage systems?

What are some potential uses that are often overlooked?

There are a few ways that graph structures and properties can be implemented, including the ability t

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

Introduction How did you get involved in the area of data management? What was your initial project requirement?

What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target?

Can you describe your current deployment architecture?

How many engineers are involved in writing tasks for your Airflow installation?

What resources were the most helpful while learning about Airflow design patterns?

How have you architected your DAGs for deployment and extensibility?

What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation?

If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice?

What are your next steps for improvements and fixes?

Contact Info

@eronarn on Twitter Website eronarn on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm

Podcast Interview

AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin

Medium Blog

Celery Dask

Podcast Interview

PostgreSQL

Podcast Interview

Redis Cloudformation Jupyter Notebook Qubole Astronomer

Podcast Interview

Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Big Data Analytics with Hadoop 3

Big Data Analytics with Hadoop 3 is your comprehensive guide to understanding and leveraging the power of Apache Hadoop for large-scale data processing and analytics. Through practical examples, it introduces the tools and techniques necessary to integrate Hadoop with other popular frameworks, enabling efficient data handling, processing, and visualization. What this Book will help me do Understand the foundational components and features of Apache Hadoop 3 such as HDFS, YARN, and MapReduce. Gain the ability to integrate Hadoop with programming languages like Python and R for data analysis. Learn the skills to utilize tools such as Apache Spark and Apache Flink for real-time data analytics within the Hadoop ecosystem. Develop expertise in setting up a Hadoop cluster and performing analytics in cloud environments such as AWS. Master the process of building practical big data analytics pipelines for end-to-end data processing. Author(s) Sridhar Alla is a seasoned big data professional with extensive industry experience in building and deploying scalable big data analytics solutions. Known for his expertise in Hadoop and related ecosystems, Sridhar combines technical depth with clear communication in his writing, providing practical insights and hands-on knowledge. Who is it for? This book is tailored for data professionals, software engineers, and data scientists looking to expand their expertise in big data analytics using Hadoop 3. Whether you're an experienced developer or new to the big data ecosystem, this book provides the step-by-step guidance and practical examples needed to advance your skills and achieve your analytical goals.

Python Web Scraping Cookbook

Python Web Scraping Cookbook is your comprehensive guide to building efficient and functional web scraping tools using Python. With practical recipes, you'll learn to overcome the challenges of dynamic content, captcha, and irregular web structures while deploying scalable solutions. What this Book will help me do Master the use of Python libraries like BeautifulSoup and Scrapy for scraping data. Perfect techniques for handling JavaScript-heavy sites using Selenium. Learn to overcome web scraping challenges, such as captchas and rate-limiting. Design scalable scraping pipelines with cloud deployment in AWS. Understand web data extraction techniques with XPath, CSS selectors, and more. Author(s) Michael Heydt is a seasoned software engineer and technical author with a focus on data engineering and cloud solutions. Having worked with Python extensively, he brings real-world insights into web scraping. His practical approach simplifies complex concepts. Who is it for? This book is perfect for Python developers and data enthusiasts keen to master web scraping techniques. If you're a programmer with insights into Python scripting and wish to scrape, analyze, and utilize web data efficiently, this book is for you.

If you work with a media agency (or are one) the first question to ask them is how many data scientists do you have? Do you prefer Amazon Web Services, Microsoft Azure, or the Google Cloud Platform? Come see examples from one of Canada's largest retailers of advertising spending that is wasted from poor targeting, access issues, and lack of big data understanding. We will also dive into examples of broken implementations of Analytics that cause even more issues. If you are not in-sourcing the core components of your Media and Analytics you are almost certainly at risk or already suffering from many of these problems. In this session, Martin and Charles Farina will show you what you need to find the right partner, but more importantly what you also have to provide.

Summary

Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Walter Menendez about the data engineering platform at Buzzfeed

Interview

Introduction How did you get involved in the area of data management? How is the data engineering team at Buzzfeed structured and what kinds of projects are you responsible for? What are some of the types of data inputs and outputs that you work with at Buzzfeed? Is the core of your system using a real-time streaming approach or is it primarily batch-oriented and what are the business needs that drive that decision? What does the architecture of your data platform look like and what are some of the most significant areas of technical debt? Which platforms and languages are most widely leveraged in your team and what are some of the outliers? What are some of the most significant challenges that you face, both technically and organizationally? What are some of the dead ends that you have run into or failed projects that you have tried? What has been the most successful project that you have completed and how do you measure that success?

Contact Info

@hackwalter on Twitter walterm on GitHub

Links

Data Literacy MIT Media Lab Tumblr Data Capital Data Infrastructure Google Analytics Datadog Python Numpy SciPy NLTK Go Language NSQ Tornado PySpark AWS EMR Redshift Tracking Pixel Google Cloud Don’t try to be google Stop Hiring DevOps Engineers and Start Growing Them

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Web Development with MongoDB and Node - Third Edition

Explore the power of combining Node.js and MongoDB to build modern, scalable web applications in 'Web Development with MongoDB and Node.' You'll not only learn how to integrate these two technologies effectively, but you'll also gain practical insights into using modern frameworks like Express and Angular to build feature-rich web apps. What this Book will help me do Master core concepts of Node.js and MongoDB for efficient web development. Learn to build and configure a web server using the Express.js framework. Implement data persistence with MongoDB using the Mongoose ODM library. Automate testing using tools like Mocha and streamline workflows with Gulp. Deploy applications to cloud platforms like Heroku, AWS, and Microsoft Azure. Author(s) Jason Krol and None Joseph D'mello, along with None Satheesh, bring extensive experience in web development and technical writing to this book. The authors have collectively worked on cutting-edge web technologies for years and are passionate about sharing their expertise to help developers create efficient web applications. Who is it for? This book is perfect for JavaScript developers at any proficiency level who are looking to expand their skills into full-stack development with Node.js and MongoDB. Even if you have a basic understanding of JavaScript and HTML, this book will guide you through building complete web applications from scratch. If you're eager to learn and create performant, scalable web apps, this book is for you.

Summary

Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Ry Walker, CEO of Astronomer, the platform for data engineering.

Interview

Introduction How did you first get involved in the area of data management? What is Astronomer and how did it get started? Regulatory challenges of processing other people’s data What does your data pipelining architecture look like? What are the most challenging aspects of building a general purpose data management environment? What are some of the most significant sources of technical debt in your platform? Can you share some of the failures that you have encountered while architecting or building your platform and company and how you overcame them? There are certain areas of the overall data engineering workflow that are well defined and have numerous tools to choose from. What are some of the unsolved problems in data management? What are some of the most interesting or unexpected uses of your platform that you are aware of?

Contact Information

Email @rywalker on Twitter

Links

Astronomer Kiss Metrics Segment Marketing tools chart Clickstream HIPAA FERPA PCI Mesos Mesos DC/OS Airflow SSIS Marathon Prometheus Grafana Terraform Kafka Spark ELK Stack React GraphQL PostGreSQL MongoDB Ceph Druid Aries Vault Adapter Pattern Docker Kinesis API Gateway Kong AWS Lambda Flink Redshift NOAA Informatica SnapLogic Meteor

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast