talk-data.com talk-data.com

Topic

SQL

Structured Query Language (SQL)

database_language data_manipulation data_definition programming_language

1751

tagged

Activity Trend

107 peak/qtr
2020-Q1 2026-Q1

Activities

1751 activities · Newest first

Summary Data engineering is a practice that is multi-faceted and requires integration with a large number of systems. This often means working across multiple tools to get the job done which can introduce significant cost to productivity due to the number of context switches. Rivery is a platform designed to reduce this incidental complexity and provide a single system for working across the different stages of the data lifecycle. In this episode CEO and founder Itamar Ben hemo explains how his experiences in the industry led to his vision for the Rivery platform as a single place to build end-to-end analytical workflows, including how it is architected and how you can start using it today for your own work.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Itamar Ben Hemo about Rivery, a SaaS platform designed to provide an end-to-end solution for Ingestion, Transformation, Orchestration,

Summary The flexibility of software oriented data workflows is useful for fulfilling complex requirements, but for simple and repetitious use cases it adds significant complexity. Coalesce is a platform designed to reduce repetitive work for common workflows by adopting a visual pipeline builder to support your data warehouse transformations. In this episode Satish Jayanthi explains how he is building a framework to allow enterprises to move quickly while maintaining guardrails for data workflows. This allows everyone in the business to participate in data analysis in a sustainable manner.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Satish Jayanthi about how organizations can use data architectural patterns to stay competitive in today’s data-rich environment

Interview

Introduction How did you get involved in the area of data management? Can you describe what you are building at C

Summary At the foundational layer many databases and data processing engines rely on key/value storage for managing the layout of information on the disk. RocksDB is one of the most popular choices for this component and has been incorporated into popular systems such as ksqlDB. As these systems are scaled to larger volumes of data and higher throughputs the RocksDB engine can become a bottleneck for performance. In this episode Adi Gelvan shares the work that he and his team at SpeeDB have put into building a drop-in replacement for RocksDB that eliminates that bottleneck. He explains how they redesigned the core algorithms and storage management features to deliver ten times faster throughput, how the lower latencies work to reduce the burden on platform engineers, and how they are working toward an open source offering so that you can try it yourself with no friction.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale Your host is Tobias Macey and today I’m interviewing Adi Gelvan about his work on SpeeDB, the "next generation data engine"

Interview

Introduction How did you get involved in the area of data management? Can you describe what SpeeDB is and the story behind it? What is your target market and customer?

What are some of the shortcomings of RocksDB t

Introducing Charticulator for Power BI: Design Vibrant and Customized Visual Representations of Data

Create stunning and complex visualizations using the amazing Charticulator custom visuals in Power BI. Charticulator offers users immense power to generate visuals and graphics. To a beginner, there are myriad settings and options that can be combined in what feels like an unlimited number of combinations, giving it the unfair label, “the DAX of the charting world”. This is not true. This book is your start-to-finish guide to using Charticulator, a custom visualization software that Microsoft integrated into Power BI Desktop so that Power BI users can create incredibly powerful, customized charts and graphs. You will learn the concepts that underpin the software, journeying through every building block of chart design, enabling you to combine these parts to create spectacular visuals that represent the story of your data. Unlike other custom Power BI visuals, Charticulator runs in a separate application window within Power BI with its own interface and requires a different set of interactions and associated knowledge. This book covers the ins and outs of all of them. What You Will Learn Generate inspirational and technically competent visuals with no programming or other specialist technical knowledge Create charts that are not restricted to conventional chart types such as bar, line, or pie Limit the use of diverse Power BI custom visuals to one Charticulator custom visual Alleviate frustrations with the limitations of default chart types in Power BI, such as being able to plot data on only one categorical axis Use a much richer set of options to compare different sets of data Re-use your favorite or most often used chart designs with Charticulator templates Who This Book Is For The average Power BI user. It assumes no prior knowledge on the part of the reader other than being able to open Power BI desktop, import data, and create a simple Power BI visual. User experiences may vary, from people attending a Power BI training course to those with varying skills and abilities, from SQL developers and advanced Excel users to people with limited data analysis experience and technical skills.

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey beginning with the basics of data ingestion, processing, and transformation, and ending up with an entire local data platform running Apache Spark, Apache Zeppelin, Apache Kafka, Redis, MySQL, Minio (S3), and Apache Airflow. Apache Spark applications solve a wide range of data problems from traditional data loading and processing to rich SQL-based analysis as well as complex machine learning workloads and even near real-time processing of streaming data. Spark fits well as a central foundation for any data engineering workload. This book will teach you to write interactive Spark applications using Apache Zeppelin notebooks, write and compilereusable applications and modules, and fully test both batch and streaming. You will also learn to containerize your applications using Docker and run and deploy your Spark applications using a variety of tools such as Apache Airflow, Docker and Kubernetes. ​Reading this book will empower you to take advantage of Apache Spark to optimize your data pipelines and teach you to craft modular and testable Spark applications. You will create and deploy mission-critical streaming spark applications in a low-stress environment that paves the way for your own path to production. ​ What You Will Learn Simplify data transformation with Spark Pipelines and Spark SQL Bridge data engineering with machine learning Architect modular data pipeline applications Build reusable application components and libraries Containerize your Spark applications for consistency and reliability Use Docker and Kubernetes to deploy your Spark applications Speed up application experimentation using Apache Zeppelin and Docker Understand serializable structured data and data contracts Harness effective strategies for optimizing data in your data lakes Build end-to-end Spark structured streaming applications using Redis and Apache Kafka Embrace testing for your batch and streaming applications Deploy and monitor your Spark applications Who This Book Is For Professional software engineers who want to take their current skills and apply them to new and exciting opportunities within the data ecosystem, practicing data engineers who are looking for a guiding light while traversing the many challenges of moving from batch to streaming modes, data architects who wish to provide clear and concise direction for how best to harness anduse Apache Spark within their organization, and those interested in the ins and outs of becoming a modern data engineer in today's fast-paced and data-hungry world

Summary Data assets and the pipelines that create them have become critical production infrastructure for companies. This adds a requirement for reliability and management of up-time similar to application infrastructure. In this episode Francisco Alberini and Mei Tao share their insights on what incident management looks like for data platforms and the teams that support them.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Francisco Alberini and Mei Tao about patterns and practices for incident management in data teams

Interview

Introduction How did you get involved in the area of data management? Can you start by describing some of the ways that an "incident" can manifest in a data system?

At a high level, what are the steps and participants required to bring an incident to resolution?

The principle of incident management is familiar to application/site reliability teams. What is the current state of the art/adoption for these practices among data teams? What are the signals that teams should be monitoring to identify and alert on potential incidents?

Alerting is a subjective and nuanced practice, regardless of the context. What are some useful practices that you have seen and enacted to reduce alert fatigue

Summary Data and analytics are permeating every system, including customer-facing applications. The introduction of embedded analytics to an end-user product creates a significant shift in requirements for your data layer. The Pinot OLAP datastore was created for this purpose, optimizing for low latency queries on rapidly updating datasets with highly concurrent queries. In this episode Kishore Gopalakrishna and Xiang Fu explain how it is able to achieve those characteristics, their work at StarTree to make it more easily available, and how you can start using it for your own high throughput data workloads today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product today at dataengineeringpodcast.com/acryl Your host is Tobias Macey and today I’m interviewing Kishore Gopalakrishna and Xiang Fu about Apache Pinot and its applications for powering user-facing analytics

Interview

Introduction How did you get involved in the area of data management? Can you describe what Pinot is and the story behind it? What are the primary use cases that Pinot is designed to support? There are numerous OLAP engines available with varying tradeoffs and optimal use cases. What are the cases where Pinot is the preferred choice?

How does it compare to systems such as Clickhouse (for OLAP) or CubeJS/GoodData (for embedded analytics)?

How do the operational needs of a database engine change as you move from serving internal stakeholders to external end-users? Can you describe how Pinot is architected?

What were the key design elements that were necessary to support low-latency queries with high concurrency?

Can you describe a typical end-to-end architecture where Pinot will be used for embedded analytics?

What are some of the tools/technologies/platforms/design patterns that Pinot might replace or obviate?

What are some of the useful lessons related to data modeling that users of Pinot should consider?

What are some edge cases that they might encounter due to details of how the storage layer is architected? (e.g. data

Summary The modern data stack is a constantly moving target which makes it difficult to adopt without prior experience. In order to accelerate the time to deliver useful insights at organizations of all sizes that are looking to take advantage of these new and evolving architectures Tarush Aggarwal founded 5X Data. In this episode he explains how he works with these companies to deploy the technology stack and pairs them with an experienced engineer who assists with the implementation and training to let them realize the benefits of this architecture. He also shares his thoughts on the current state of the ecosystem for modern data vendors and trends to watch as we move into the future.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Tarush Agarwal about how he and his team are helping organizations streamline adoption of the modern data stack

Interview

Introduction How did you get involved in the area of data management? Can you describe what you are doing at 5x and the story behind it? How has your focus and operating model shifted since we spoke a year ago?

What are the biggest shifts in the market for data management that you have seen in that time?

What are the main challenges that your customers are facing when they start working with you? What are the components that you are relying on to build repeatable data platforms for your customers?

What are the sharp edges that you have had to smooth out to scale your implementation of those

Summary Data observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale Your host is Tobias Macey and today I’m interviewing Tristan Spaulding about Acceldata, a platform offering multidimensional data observability for modern data infrastructure

Interview

Introduction How did you get involved in the area of data? Can you describe what Acceldata is and the story behind it? What does it mean for a data observability platform to be "multidimensional"? How do the architectural characteristics of the "modern data stack" influence the requirements and implementation of data observability strategies? The data observability ecosystem has seen a lot of activity over the past ~2-3 years. What are the unique capabilities/use cases that Acceldata supports? Who are your target users and how does that focus influence the way that you have approached feature and design priorities? What are some of the ways that you are using the Acceldata platform to run Acceldata? Can you describe how the Acceldata platform is implemented?

How have the design and goals of the system changed or evolved since you started working on it?

How are you man

Getting Started with CockroachDB

"Getting Started with CockroachDB" provides an in-depth introduction to CockroachDB, a modern, distributed SQL database designed for cloud-native applications. Through this guide, you'll learn how to deploy, manage, and optimize CockroachDB to build highly reliable, scalable database solutions tailored for demanding and distributed workloads. What this Book will help me do Understand the architecture and design principles of CockroachDB and its fault-tolerant model. Learn how to set up and manage CockroachDB clusters for high availability and automatic scaling. Discover the concepts of data distribution and geo-partitioning to achieve low-latency global interactions. Explore indexing mechanisms in CockroachDB to optimize query performance for fast data retrieval. Master operational strategies, security configuration, and troubleshooting techniques for database management. Author(s) Kishen Das Kondabagilu Rajanna is an experienced software developer and database expert with a deep interest in distributed architectures. With hands-on experience working with CockroachDB and other database technologies, Kishen is passionate about sharing actionable insights with readers. His approach focuses on equipping developers with practical skills to excel in building and managing scalable, efficient database services. Who is it for? This book is ideal for software developers, database administrators, and database engineers seeking to learn CockroachDB for building robust, scalable database systems. If you're new to CockroachDB but possess basic database knowledge, this guide will equip you with the practical skills to leverage CockroachDB's capabilities effectively.

Summary When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macey and today I’m interviewing Hannes Mühleisen about DuckDB, an in-process embedded database engine for columnar analytics

Interview

Introduction How did you get involved in the area of data management? Can you describe what DuckDB is and the story behind it? Where did the name come from? What are some of the use cases that DuckDB is designed to support? The interface for DuckDB is similar (at least in spirit) to SQLite. What are the deciding factors for when to use one vs. the other?

How might they be used in concert to take advantage of their relative strengths?

What are some of the ways that DuckDB can be used to better effect than options provided by different language ecosystems? Can you describe how DuckDB is implemented?

How has the design and goals of the project changed or evolved since you began working on it? What are some of the optimizations that you have had to make in order to support performant access to data that exceeds available memory?

Can you describe a typical workflow of incorporating DuckDB into an analytical project? What are some of the libraries/tools/systems that DuckDB might replace in the scope of a project or team? What are some of the

Summary Databases are an important component of application architectures, but they are often difficult to work with. HarperDB was created with the core goal of being a developer friendly database engine. In the process they ended up creating a scalable distributed engine that works across edge and datacenter environments to support a variety of novel use cases. In this episode co-founder and CEO Stephen Goldberg shares the history of the project, how it is architected to achieve their goals, and how you can start using it today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Stephen Goldberg about HarperDB, a developer-friendly distributed database engine designed to scale acros

Practical SQL, 2nd Edition

Practical SQL is an approachable and fast-paced guide to SQL (Structured Query Language), the standard programming language for defining, organizing, and exploring data in relational databases. Anthony DeBarros, a journalist and data analyst, focuses on using SQL to find the story within your data. The examples and code use the open-source database PostgreSQL and its companion pgAdmin interface, and the concepts you learn will apply to most database management systems, including MySQL, Oracle, SQLite, and others.* You’ll first cover the fundamentals of databases and the SQL language, then build skills by analyzing data from real-world datasets such as US Census demographics, New York City taxi rides, and earthquakes from US Geological Survey. Each chapter includes exercises and examples that teach even those who have never programmed before all the tools necessary to build powerful databases and access information quickly and efficiently. You’ll learn how to: •Create databases and related tables using your own data •Aggregate, sort, and filter data to find patterns •Use functions for basic math and advanced statistical operations •Identify errors in data and clean them up •Analyze spatial data with a geographic information system (PostGIS) •Create advanced queries and automate tasks This updated second edition has been thoroughly revised to reflect the latest in SQL features, including additional advanced query techniques for wrangling data. This edition also has two new chapters: an expanded set of instructions on for setting up your system plus a chapter on using PostgreSQL with the popular JSON data interchange format. Learning SQL doesn’t have to be dry and complicated. Practical SQL delivers clear examples with an easy-to-follow approach to teach you the tools you need to build and manage your own databases. * Microsoft SQL Server employs a variant of the language called T-SQL, which is not covered by Practical SQL.

Summary There are a wealth of options for managing structured and textual data, but unstructured binary data assets are not as well supported across the ecosystem. As organizations start to adopt cloud technologies they need a way to manage the distribution, discovery, and collaboration of data across their operating environments. To help solve this complicated challenge Krishna Subramanian and her co-founders at Komprise built a system that allows you to treat use and secure your data wherever it lives, and track copies across environments without requiring manual intervention. In this episode she explains the difficulties that everyone faces as they scale beyond a single operating environment, and how the Komprise platform reduces the burden of managing large and heterogeneous collections of unstructured files.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Krishna Subramanian about her work at Komprise to generate value from unstructured file and object data across storage formats and locations

Interview

Introduction How did you get involved in the area of data management? Can you describe what Komprise is and the story behind it? Who are the target customers of the Komprise platform?

What are the core use cases that you are focused on supporting?

How would you characterize the common approaches to managing file storage solutions for hybrid cloud environments?

What are some of the shortcomings of the enterprise storage providers’ met

Summary Building a data platform is a complex journey that requires a significant amount of planning to do well. It requires knowledge of the available technologies, the requirements of the operating environment, and the expectations of the stakeholders. In this episode Tobias Macey, the host of the show, reflects on his plans for building a data platform and what he has learned from running the podcast that is influencing his choices.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. I’m your host, Tobias Macey, and today I’m sharing the approach that I’m taking while designing a data platform

Interview

Introduction How did you get involved in the area of data management? What are the components that need to be considered when designing a solution?

Data integration (extract and load)

What are your data sources? Batch or streaming (acceptable latencies)

Data storage (lake or warehouse)

How is the data going to be used? What other tools/systems will need to integrate with it? The warehouse (Bigquery, Snowflake, Redshift) has become the focal point of the "modern data stack"

Data orchestration

Who will be managing the workflow logic?

Metadata repository

Types of metadata (catalog, lineage, access, queries, etc.)

Semantic layer/reporting Data applications

Implementation phases

Build a single end-to-end workflow of a data application using a single category of data across sources Validate the ability for an analyst/data scientist to self-serve a notebook powered analysis Iterate

Risks/unknowns

Data modeling requirements Specific implementation details as integrations acros

Analytics Optimization with Columnstore Indexes in Microsoft SQL Server: Optimizing OLAP Workloads

Meet the challenge of storing and accessing analytic data in SQL Server in a fast and performant manner. This book illustrates how columnstore indexes can provide an ideal solution for storing analytic data that leads to faster performing analytic queries and the ability to ask and answer business intelligence questions with alacrity. The book provides a complete walk through of columnstore indexing that encompasses an introduction, best practices, hands-on demonstrations, explanations of common mistakes, and presents a detailed architecture that is suitable for professionals of all skill levels. With little or no knowledge of columnstore indexing you can become proficient with columnstore indexes as used in SQL Server, and apply that knowledge in development, test, and production environments. This book serves as a comprehensive guide to the use of columnstore indexes and provides definitive guidelines. You will learn when columnstore indexes shouldbe used, and the performance gains that you can expect. You will also become familiar with best practices around architecture, implementation, and maintenance. Finally, you will know the limitations and common pitfalls to be aware of and avoid. As analytic data can become quite large, the expense to manage it or migrate it can be high. This book shows that columnstore indexing represents an effective storage solution that saves time, money, and improves performance for any applications that use it. You will see that columnstore indexes are an effective performance solution that is included in all versions of SQL Server, with no additional costs or licensing required. What You Will Learn Implement columnstore indexes in SQL Server Know best practices for the use and maintenance of analytic data in SQL Server Use metadata to fully understand the size and shape of data stored in columnstore indexes Employ optimal ways to load, maintain, and delete data from large analytic tables Know how columnstore compression saves storage, memory, and time Understand when a columnstore index should be used instead of a rowstore index Be familiar with advanced features and analytics Who This Book Is For Database developers, administrators, and architects who are responsible for analytic data, especially for those working with very large data sets who are looking for new ways to achieve high performance in their queries, and those with immediate or future challenges to analytic data and query performance who want a methodical and effective solution

What Is Distributed SQL?

Globally available resources have become the status quo. They're accessible, distributed, and resilient. Our traditional SQL database options haven't kept up. Centralized SQL databases, even those with read replicas in the cloud, put all the transactional load on a central system. The further away that a transaction happens from the user, the more the user experience suffers. If the transactional data powering the application is greatly slowed down, fast-loading web pages mean nothing. In this report, Paul Modderman, Jim Walker, and Charles Custer explain how distributed SQL fits all applications and eliminates complex challenges like sharding from traditional RDBMS systems. You'll learn how distributed SQL databases can reach global scale without introducing the consistency trade-offs found in NoSQL solutions. These databases come to life through cloud computing, while legacy databases simply can't rise to meet the elastic and ubiquitous new paradigm. You'll learn: Key concepts driving this new technology, including the CAP theorem, the Raft consensus algorithm, multiversion concurrency control, and Google Spanner How distributed SQL databases meet enterprise requirements, including management, security, integration, and Everything as a Service (XaaS) The impact that distributed SQL has already made in the telecom, retail, and gaming industries Why serverless computing is an ideal fit for distributed SQL How distributed SQL can help you expand your company's strategic plan

Summary Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Every data project starts with collecting the information that will provide answers to your questions or inputs to your models. The web is the largest trove of information on the planet and Oxylabs helps you unlock its potential. With the Oxylabs scraper APIs you can extract data from even javascript heavy websites. Combined with their residential proxies you can be sure that you’ll have reliable and high quality data whenever you need it. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies. Your host is Tobias Macey and today I’m interviewing Kevin Kho about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites

Interview

Introduction How did you get involved in the area of data management? Can you describe what Fugue is and the story behind it? What are the core goals of the Fugue project? Who are the target users for Fugue and how does that influence the feature priorities and API design? How does Fugue compare to projects such as Modin, etc. for abst

Summary Streaming data sources are becoming more widely available as tools to handle their storage and distribution mature. However it is still a challenge to analyze this data as it arrives, while supporting integration with static data in a unified syntax. Deephaven is a project that was designed from the ground up to offer an intuitive way for you to bring your code to your data, whether it is streaming or static without having to know which is which. In this episode Pete Goddard, founder and CEO of Deephaven shares his journey with the technology that powers the platform, how he and his team are pouring their energy into the community edition of the technology so that you can use it freely in your own work.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month. Your host is Tobias Macey and today I’m interviewing Pete Goddard about his work at Deephaven, a query engine optimized for manipulating and merging streaming and static data

Interview

Introduction How did you get involved in the area of data management? Can you describe what Deephaven is and the story behind it? What is the role of Deephaven in the context of an organization’s data platform?

What are the upstream and downstream systems and teams that it is likely to be integrated with?

Who are the target users of Deephaven and how does that influence the feature priorities and design of the platform? comparison of use cases/experience with Materialize What are the different components that comprise the suite of functionality in Deephaven? How have you architected the system?

What are some of the ways t

PHP & MySQL: Novice to Ninja, 7th Edition

PHP & MySQL: Novice to Ninja, 7th Edition is a hands-on guide to learning all the tools, principles, and techniques needed to build a professional web application using PHP & MySQL. Comprehensively updated to cover PHP 8 and modern best practice, this highly practical and fun book covers everything from installation through to creating a complete online content management system. Gain a thorough understanding of PHP syntax Master database design principles and SQL Write robust, maintainable, best practice code Build a working content management system (CMS) And much more!