postgresql

Power BI Data Analysis and Visualization

2018-09-10 · O'Reilly Data Visualization Books O'Reilly Amazon

book

by Suren Machiraju , Suraj Gaurav

Azure BI CRM Dashboard DataViz ERP Microsoft Power BI SQL SQL Server Data Streaming business-intelligence +4 more

Power BI Data Analysis and Visualization provides a roadmap to vendor choices and highlights why Microsoft’s Power BI is a very viable, cost effective option for data visualization. The book covers the fundamentals and most commonly used features of Power BI, but also includes an in-depth discussion of advanced Power BI features such as natural language queries; embedding Power BI dashboards; and live streaming data. It discusses real solutions to extract data from the ERP application, Microsoft Dynamics CRM, and also offers ways to host the Power BI Dashboard as an Azure application, extracting data from popular data sources like Microsoft SQL Server and open-source PostgreSQL. Authored by Microsoft experts, this book uses real-world coding samples and screenshots to spotlight how to create reports, embed them in a webpage, view them across multiple platforms, and more. Business owners, IT professionals, data scientists, and analysts will benefit from this thorough presentation of Power BI and its functions.

Putting Airflow Into Production With James Meickle - Episode 43

2018-08-13 · Data Engineering Podcast Listen

podcast_episode

by James Meickle , Tobias Macey

Airflow Ansible API Astronomer AWS CloudFormation AWS Glue Data Engineering Data Management Data Science DevOps ETL/ELT +7 more

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

Introduction How did you get involved in the area of data management? What was your initial project requirement?

What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target?

Can you describe your current deployment architecture?

How many engineers are involved in writing tasks for your Airflow installation?

What resources were the most helpful while learning about Airflow design patterns?

How have you architected your DAGs for deployment and extensibility?

What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation?

If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice?

What are your next steps for improvements and fixes?

Contact Info

@eronarn on Twitter Website eronarn on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm

Podcast Interview

AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin

Medium Blog

Celery Dask

Podcast Interview

PostgreSQL

Podcast Interview

Redis Cloudformation Jupyter Notebook Qubole Astronomer

Podcast Interview

Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

2018-08-06 · Data Engineering Podcast Listen

podcast_episode

by Jonathan Katz (Amazon Redshift) , Tobias Macey

API Chef Data Engineering Data Management DataOps GitHub MySQL Oracle SQL

Summary

One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jonathan Katz about a high level view of PostgreSQL and the unique capabilities that it offers

Interview

Introduction How did you get involved in the area of data management? How did you get involved in the Postgres project? For anyone who hasn’t used it, can you describe what PostgreSQL is?

Where did Postgres get started and how has it evolved over the intervening years?

What are some of the primary characteristics of Postgres that would lead someone to choose it for a given project?

What are some cases where Postgres is the wrong choice?

What are some of the common points of confusion for new users of PostGreSQL? (particularly if they have prior database experience) The recent releases of Postgres have had some fairly substantial improvements and new features. How does the community manage to balance stability and reliability against the need to add new capabilities? What are the aspects of Postgres that allow it to remain relevant in the current landscape of rapid evolution at the data layer? Are there any plans to incorporate a distributed transaction layer into the core of the project along the lines of what has been done with Citus or CockroachDB? What is in store for the future of Postgres?

Contact Info

@jkatz05 on Twitter jkatz on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

PostgreSQL Crunchy Data Venuebook Paperless Post LAMP Stack MySQL PHP SQL ORDBMS Edgar Codd A Relational Model of Data for Large Shared Data Banks Relational Algebra Oracle DB UC Berkeley Dr. Michae

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

2018-07-30 · Data Engineering Podcast Listen

podcast_episode

by Peter Lubell-Doughtie (Ona) , Tobias Macey

Ansible API Chef Data Collection Data Engineering Data Management DataOps Docker Druid DWH GitHub Kafka +2 more

Summary

With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy

Interview

Introduction How did you get involved in the area of data management? What is Ona and how did the company get started?

What are some examples of the types of customers that you work with?

What types of data do you support in your collection platform? What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users? Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization? What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers? Can you describe the flow of the data from collection through to analysis? To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?

What are the architectural considerations that you factored in when designing it? What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?

What are your plans for the future of Ona and Canopy?

Contact Info

Email pld on Github Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

OpenSRP Ona Canopy Open Data Kit Earth Institute at Columbia University Sustainable Engineering Lab WHO Bill and Melinda Gates Foundation XLSForms PostGIS Kafka Druid Superset Postgres Ansible Docker Terraform

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

2018-06-25 · Data Engineering Podcast Listen

podcast_episode

by Kevin Moore (Quilt Data) , Tobias Macey

AI/ML Airflow API Arrow Chef Data Engineering Data Management DataOps Docker GitHub Hierarchical Data Format Hive +7 more

Summary

Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data

Interview

Introduction How did you get involved in the area of data management? What is the intended use case for Quilt and how did the project get started? Can you step through a typical workflow of someone using Quilt?

How does that change as you go from a single user to a team of data engineers and data scientists?

Can you describe the elements of what a data package consists of?

What was your criteria for the file formats that you chose?

How is Quilt architected and what have been the most significant changes or evolutions since you first started? How is the data registry implemented?

What are the limitations or edge cases that you have run into? What optimizations have you made to accelerate synchronization of the data to and from the repository?

What are the limitations in terms of data volume, format, or usage? What is your goal with the business that you have built around the project? What are your plans for the future of Quilt?

Contact Info

Email LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quilt Data GitHub Jobs Reproducible Data Dependencies in Jupyter Reproducible Machine Learning with Jupyter and Quilt Allen Institute: Programmatic Data Access with Quilt Quilt Example: MissingNo Oracle Pandas Jupyter Ycombinator Data.World

Podcast Episode with CTO Bryon Jacob

Kaggle Parquet HDF5 Arrow PySpark Excel Scala Binder Merkle Tree Allen Institute for Cell Science Flask PostGreSQL Docker Airflow Quilt Teams Hive Hive Metastore PrestoDB

Podcast Episode

Netflix Iceberg Kubernetes Helm

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

SQL Primer: An Accelerated Introduction to SQL Basics

2018-06-15 · O'Reilly SQL Books O'Reilly Amazon

book

by Rahul Batra

Data Management Data Science SQL

Build a core level of competency in SQL so you can recognize the parts of queries and write simple SQL statements. SQL knowledge is essential for anyone involved in programming, data science, and data management. This book covers features of SQL that are standardized and common across most database vendors. You will gain a base of knowledge that will prepare you to go deeper into the specifics of any database product you might encounter. Examples in the book are worked in PostgreSQL and SQLite, but the bulk of the examples are platform agnostic and will work on any database platform supporting SQL. Early in the book you learn about table design, the importance of keys as row identifiers, and essential query operations. You then move into more advanced topics such as grouping and summarizing, creating calculated fields, joining data from multiple tables when it makes business sense to do so, and more. Throughout the book, you are exposed to a set-based approachto the language and are provided a good grounding in subtle but important topics such as the effects of null value on query results. With the explosion of data science, SQL has regained its prominence as a top skill to have for technologists and decision makers worldwide. SQL Primer will guide you from the very basics of SQL through to the mainstream features you need to have a solid, working knowledge of this important, data-oriented language. What You'll Learn Create and populate your own database tables Read SQL queries and understand what they are doing Execute queries that get correct results Bring together related rows from multiple tables Group and sort data in support of reporting applications Get a grip on nulls, normalization, and other key concepts Employ subqueries, unions, and other advanced features Who This Book Is For Anyone new to SQL who is looking for step-by-step guidance toward understanding and writing SQL queries. The book is aimed at those who encounter SQL statements often in their work, and provides a sound baseline useful across all SQL database systems. Programmers, database managers, data scientists, and business analysts all can benefit from the baseline of SQL knowledge provided in this book.

CockroachDB In Depth with Peter Mattis - Episode 35

2018-06-11 · Data Engineering Podcast Listen

podcast_episode

by Peter Mattis (Cockroach Labs) , Tobias Macey

API Cloud Computing Data Engineering Data Management Datadog Docker GDPR/CCPA GitHub Go Kubernetes NoSQL RDBMS +4 more

Summary

With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services

Interview

Introduction How did you get involved in the area of data management? What was the motivation for creating CockroachDB and building a business around it? Can you describe the architecture of CockroachDB and how it supports distributed ACID transactions?

What are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions? What are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism?

Go is an unconventional language for building a database. What are the pros and cons of that choice? What are some of the common points of confusion that users of CockroachDB have when operating or interacting with it?

What are the edge cases and failure modes that users should be aware of?

I know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB?

What are some examples of extensions that are specific to CockroachDB?

What are some of the most interesting uses of CockroachDB that you have seen? When is CockroachDB the wrong choice? What do you have planned for the future of CockroachDB?

Contact Info

Peter

LinkedIn petermattis on GitHub @petermattis on Twitter

Cockroach Labs

@CockroackDB on Twitter Website cockroachdb on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

CockroachDB Cockroach Labs SQL Google Bigtable Spanner NoSQL RDBMS (Relational Database Management System) “Big Iron” (colloquial term for mainframe computers) RAFT Consensus Algorithm Consensus MVCC (Multiversion Concurrency Control) Isolation Etcd GDPR Golang C++ Garbage Collection Metaprogramming Rust Static Linking Docker Kubernetes CAP Theorem PostGreSQL ORM (Object Relational Mapping) Information Schema PG Catalog Interleaved Tables Vertica Spark Change Data Capture

The intro and outro music is from The Hug by The Freak Fandan

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

2018-06-04 · Data Engineering Podcast Listen

podcast_episode

by Jan Steeman (ArangoDB) , Jan Stücke (ArangoDB) , Tobias Macey

API Data Engineering Data Management Data Modelling GitHub JSON JSON Schema Cyber Security

Summary

Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage.

Interview

Introduction How did you get involved in the area of data management? Can you give a high level description of what ArangoDB is and the motivation for creating it?

What is the story behind the name?

How is ArangoDB constructed?

How does the underlying engine store the data to allow for the different ways of viewing it?

What are some of the benefits of multi-model data storage?

When does it become problematic?

For users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango? How does it compare to OrientDB? What are the options for scaling a running system?

What are the limitations in terms of network architecture or data volumes?

One of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture?

What mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code? What are some of the most interesting or surprising uses of this functionality that you have seen?

What are some of the most challenging technical and business aspects of building and promoting ArangoDB? What do you have planned for the future of ArangoDB?

Contact Info

Jan Steemann

jsteemann on GitHub @steemann on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

ArangoDB Köln Multi-model Database Graph Algorithms Apache 2 C++ ArangoDB Foxx Raft Protocol Target Partners RocksDB AQL (ArangoDB Query Language) OrientDB PostGreSQL OrientDB Studio Google Spanner 3-Tier Architecture Thomson-Reuters Arango Search Dell EMC Google S2 Index ArangoDB Geographic Functionality JSON Schema

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

2018-05-21 · Data Engineering Podcast Listen

podcast_episode

by Kamil Bajda-Pawlikowski (Starburst Data) , Tobias Macey

Analytics API Cassandra Data Engineering Data Management DWH Hadoop Hive Kafka Presto Redis SQL +1 more

Summary

Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Presto is?

What are some of the common use cases and deployment patterns for Presto?

How does Presto compare to Drill or Impala? What is it about Presto that led you to building a business around it? What are some of the most challenging aspects of running and scaling Presto? For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?

How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?

What are some cases in which Presto is not the right solution? What types of support have you found to be the most commonly requested? What are some of the types of tooling or improvements that you have made to Presto in your distribution?

What are some of the notable changes that your team has contributed upstream to Presto?

Contact Info

Website E-mail Twitter – @starburstdata Twitter – @prestodb

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Starburst Data Presto Hadapt Hadoop Hive Teradata PrestoCare Cost Based Optimizer ANSI SQL Spill To Disk Tempto Benchto Geospatial Functions Cassandra Accumulo Kafka Redis PostGreSQL

The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rss Support Data Engineering Podcast

PostgreSQL 10 Administration Cookbook - Fourth Edition

2018-05-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Simon Riggs , Gianni Ciolli

data data-engineering relational-databases

This book offers an extensive collection of practical recipes for administering PostgreSQL 10, covering everything from configuring servers to optimizing performance. By working through these structured solutions, you will develop the skills necessary to manage PostgreSQL databases effectively, making your systems reliable and responsive. What this Book will help me do Implement and leverage the latest PostgreSQL 10 features for better databases. Master techniques for performance tuning and optimization in PostgreSQL. Develop strategies for comprehensive backup and recovery processes. Learn best practices for ensuring replication and high availability. Understand how to diagnose and resolve common PostgreSQL challenges effectively. Author(s) The authors of this book are experienced database professionals with deep knowledge of PostgreSQL. They bring their practical insights and expertise to help administrators and developers achieve the most out of PostgreSQL. They are dedicated to making complex topics approachable and relevant. Who is it for? This book is for current or aspiring database administrators and developers who work with PostgreSQL. It suits those who are familiar with databases and want to gain practical skills in PostgreSQL administration. It is ideal for individuals aiming to improve performance and reliability of their PostgreSQL systems.

Practical SQL

2018-05-01 · O'Reilly SQL Books O'Reilly Amazon

book

by Anthony DeBarros

GIS Microsoft MySQL RDBMS SQL SQL Server

"Practical SQL is an approachable and fast-paced guide to SQL (Structured Query Language), the standard programming language for defining, organizing, and exploring data in relational databases. The book focuses on using SQL to find the story your data tells, with the popular open-source database PostgreSQL and the pgAdmin interface as its primary tools. You’ll first cover the fundamentals of databases and the SQL language, then build skills by analyzing data from the U.S. Census and other federal and state government agencies. With exercises and real-world examples in each chapter, this book will teach even those who have never programmed before all the tools necessary to build powerful databases and access information quickly and efficiently. You’ll learn how to: • Create databases and related tables using your own data• Define the right data types for your information• Aggregate, sort, and filter data to find patterns• Use basic math and advanced statistical functions• Identify errors in data and clean them up• Import and export data using delimited text files• Write queries for geographic information systems (GIS)• Create advanced queries and automate tasks Learning SQL doesn’t have to be dry and complicated. Practical SQL delivers clear examples with an easy-to-follow approach to teach you the tools you need to build and manage your own databases. This book uses PostgreSQL, but the SQL syntax is applicable to many database applications, including Microsoft SQL Server and MySQL."

PostgreSQL 10 High Performance - Third Edition

2018-04-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Enrico Pirozzi

SQL data data-engineering relational-databases

PostgreSQL 10 High Performance provides you with all the tools to maximize the efficiency and reliability of your PostgreSQL 10 database. Written for database admins and architects, this book offers deep insights into optimizing queries, configuring hardware, and managing complex setups. By integrating these best practices, you'll ensure scalability and stability in your systems. What this Book will help me do Optimize PostgreSQL 10 queries for improved performance and efficiency. Implement database monitoring systems to identify and resolve issues proactively. Scale your database by implementing partitioning, replication, and caching strategies. Understand PostgreSQL hardware compatibility and configuration for maximum throughput. Learn how to design high-performance solutions tailored for large and demanding applications. Author(s) Enrico Pirozzi is a seasoned database professional with extensive experience in PostgreSQL management and optimization. Having worked on large-scale database infrastructures, Enrico shares his hands-on knowledge and practical advice for achieving high performance with PostgreSQL. His approachable style makes complex topics accessible to every reader. Who is it for? This book is intended for database administrators and system architects who are working with or planning to adopt PostgreSQL 10. Readers should have a foundational knowledge of SQL and some prior exposure to PostgreSQL. If you're aiming to design efficient, scalable database solutions while ensuring high availability, this book is for you.

PostGIS Cookbook - Second Edition

2018-03-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bborie Park , Paolo Corti , Thomas Kraft , Pedro Wightman , Mayra Zurbarán , Stephen Vincent Mather

Data Management GIS Python data data-engineering geographic-information-system-gis location-data postgis

PostGIS Cookbook provides a thorough introduction to working with spatial data in the PostgreSQL environment using PostGIS. The book covers topics such as importing and exporting geographic data, analyzing vector and raster data, database optimization, and building GIS web applications. By the end, you'll be equipped to fully leverage PostGIS for spatial data projects. What this Book will help me do Efficiently import and export geographic data between PostGIS and other platforms. Apply PostGIS functions for advanced vector data analysis and visualization. Manipulate and optimize spatial data for better performance and robustness. Integrate PostGIS with Python for spatial data scripting. Develop GIS web applications leveraging PostGIS and Open Geospatial standards. Author(s) The authors of PostGIS Cookbook are experienced professionals and active contributors to the spatial database community. Vincent Mather, Pedro Wightman, Thomas Kraft, and their co-authors bring extensive software engineering and geo-computing expertise to the text. Their hands-on approach ensures practicality and relevance to current technologies. Who is it for? This book is ideal for developers and GIS professionals who want to enhance their spatial data handling skills using PostGIS. Whether you're a beginner to spatial databases or looking to extend your PostgreSQL knowledge, this book offers practical solutions and advanced techniques for spatial data management and analysis.

Beginning PostgreSQL on the Cloud: Simplifying Database as a Service on Cloud Platforms

2018-03-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Avinash Vallarapu , Baji Shaik (Amazon RDS)

Azure Cloud Computing data data-engineering relational-databases

Get started with PostgreSQL on the cloud and discover the advantages, disadvantages, and limitations of the cloud services from Amazon, Rackspace, Google, and Azure. Once you have chosen your cloud service, you will focus on securing it and developing a back-up strategy for your PostgreSQL instance as part of your long-term plan. Beginning PostgreSQL on the Cloud covers other essential topics such as setting up replication and high availability; encrypting your saved cloud data; creating a connection pooler for your database; and monitoring PostgreSQL on the cloud. The book concludes by showing you how to install and configure some of the tools that will help you get started with PostgreSQL on the cloud. This book shows you how database as a service enables you to spread your data across multiple data centers, ensuring that it is always accessible. You’ll discover that this model does not expect you to install and maintain databases yourself because the database service provider does it for you. You no longer have to worry about the scalability and high availability of your database. What You Will Learn Migrate PostgreSQL to the cloud Choose the best configuration and specifications of cloud instances Set up a backup strategy that enables point-in-time recovery Use connection pooling and load balancing on cloud environments Monitor database environments on the cloud Who This Book Is For Those who are looking to migrate to PostgreSQL on the Cloud. It will also help database administrators in setting up a cloud environment in an optimized way and help them with their day-to-day tasks.

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

2018-02-11 · Data Engineering Podcast Listen

podcast_episode

by Mike Freedman (Timescale) , Ajay Kulkarni (Timescale) , Tobias Macey

Amazon RDS Azure Cloud Computing Cloudflare Data Engineering Data Management Databricks DevOps Docker ELK GCP GitHub +14 more

Summary

As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Timescale is and how the project got started? The landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options? In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices? How is Timescale implemented and how has the internal architecture evolved since you first started working on it?

What impact has the 10.0 release of PostGreSQL had on the design of the project? Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL?

For someone who wants to start using Timescale what is involved in deploying and maintaining it? What are the axes for scaling Timescale and what are the points where that scalability breaks down?

Are you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances?

What has been the most challenging aspect of building and marketing Timescale? When is Timescale the wrong tool to use for time series data? One of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus? What are some of the most interesting uses of Timescale that you have seen? Which came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health? What features or improvements do you have planned for future releases of Timescale?

Contact Info

Ajay

LinkedIn @acoustik on Twitter Timescale Blog

Mike

Website LinkedIn @michaelfreedman on Twitter Timescale Blog

Timescale

Website @timescaledb on Twitter GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Timescale PostGreSQL Citus Timescale Design Blog Post MIT NYU Stanford SDN Princeton Machine Data Timeseries Data List of Timeseries Databases NoSQL Online Transaction Processing (OLTP) Object Relational Mapper (ORM) Grafana Tableau Kafka When Boring Is Awesome PostGreSQL RDS Google Cloud SQL Azure DB Docker Continuous Aggregates Streaming Replication PGPool II Kubernetes Docker Swarm Citus Data

Website Data Engineering Podcast Interview

Database Indexing B-Tree Index GIN Index GIST Index STE Energy Redis Graphite Prometheus pg_prometheus OpenMetrics Standard Proposal Timescale Parallel Copy Hadoop PostGIS KDB+ DevOps Internet of Things MongoDB Elastic DataBricks Apache Spark Confluent New Enterprise Associates MapD Benchmark Ventures Hortonworks 2σ Ventures CockroachDB Cloudflare EMC Timescale Blog: Why SQL is beating NoSQL, and what this means for the future of data

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss" target="_blank"…

SQL Queries for Mere Mortals: A Hands-On Guide to Data Manipulation in SQL, 4th Edition

2018-02-09 · O'Reilly SQL Books O'Reilly Amazon

book

by John L. Viescas

Microsoft MySQL Oracle RDBMS SQL SQL Server

The #1 Easy, Common-Sense Guide to SQL Queries—Updated with More Advanced Techniques and Solutions Foreword by Keith W. Hare, Vice Chair, USA SQL Standards Committee SQL Queries for Mere Mortals has earned worldwide praise as the clearest, simplest tutorial on writing effective queries with the latest SQL standards and database applications. Now, author John L. Viescas has updated this hands-on classic with even more advanced and valuable techniques. Step by step, Viescas guides you through creating reliable queries for virtually any current SQL-based database. He demystifies all aspects of SQL query writing, from simple data selection and filtering to joining multiple tables and modifying sets of data. Building on the basics, Viescas shows how to solve challenging real-world problems, including applying multiple complex conditions on one table, performing sophisticated logical evaluations, and using unlinked tables to think “outside the box.” In two brand-new chapters, you learn how to perform complex calculations on groups for sophisticated reporting, and how to partition data into windows for more flexible aggregation. Practice all you want with downloadable sample databases for today’s versions of Microsoft Office Access, Microsoft SQL Server, and the open source MySQL and PostgreSQL databases. Whether you’re a DBA, developer, user, or student, there’s no better way to master SQL. Coverage includes: Getting started: understanding what relational databases are, and ensuring that your database structures are sound SQL basics: using SELECT statements, creating expressions, sorting information with ORDER BY, and filtering data using WHERE Summarizing and grouping data with GROUP BY and HAVING clauses Drawing data from multiple tables: using INNER JOIN, OUTER JOIN, and UNION operators, and working with subqueries Modifying data sets with UPDATE, INSERT, and DELETE statements Advanced queries: complex NOT and AND, conditions, if-then-else using CASE, unlinked tables, driver tables, and more NEW! Using advanced GROUP BY keywords to create subtotals, roll-ups, and more NEW! Applying window functions to answer more sophisticated questions, and gain deeper insight into your data Software-Independent Approach! If you work with database software such as Access, MS SQL Server, Oracle, DB2, MySQL, Ingres, or any other SQL-based program, this book could save you hours of time and aggravation—before you write a single query! .

Mastering PostgreSQL 10

2018-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hans-Jürgen Schönig

Cyber Security SQL data data-engineering relational-databases

Mastering PostgreSQL 10 delves into the depths of PostgreSQL development and administration, guiding readers through advanced functionalities of the database. Covering topics such as query optimization, replication, high availability, and migration, this book equips you with the skills needed to harness the full power of PostgreSQL 10. What this Book will help me do Learn to optimize database queries to enhance performance in PostgreSQL 10. Understand advanced replication techniques and how to implement high availability. Gain expertise in managing security, backups and performing data migrations effectively. Explore query tuning and indexing strategies to speed up your database applications. Handle troubleshooting challenges by understanding problems and their solutions. Author(s) The authors of Mastering PostgreSQL 10 are experts in the field of databases, with years of experience in designing, developing, and managing PostgreSQL systems. They are passionate educators dedicated to helping professionals maximize their potential with PostgreSQL. Their practical and approachable style ensures that even complex topics are clearly explained. Who is it for? This book is ideal for PostgreSQL data architects and administrators who want to master advanced features of PostgreSQL 10. It is best suited for individuals who have prior database administration experience and a working knowledge of SQL. Readers aiming to enhance performance and implement transformations in their PostgreSQL setups will benefit immensely. Those tasked with ensuring high availability, migration, and recovery of PostgreSQL will find this book invaluable.

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

2018-01-08 · Data Engineering Podcast Listen

podcast_episode

by Ozgun Erdogan (Citus Data) , Craig Kerstiens (Citus Data) , Tobias Macey

Analytics Aurora Amazon RDS Big Data CI/CD Data Engineering Data Management GitHub Linux NoSQL SQL Data Streaming

Summary

PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ozgun Erdogan and Craig Kerstiens about Citus, worry free PostGreSQL

Interview

Introduction How did you get involved in the area of data management? Can you describe what Citus is and how the project got started? Why did you start with Postgres vs. building something from the ground up? What was the reasoning behind converting Citus from a fork of PostGres to being an extension and releasing an open source version? How well does Citus work with other Postgres extensions, such as PostGIS, PipelineDB, or Timescale? How does Citus compare to options such as PostGres-XL or the Postgres compatible Aurora service from Amazon? How does Citus operate under the covers to enable clustering and replication across multiple hosts? What are the failure modes of Citus and how does it handle loss of nodes in the cluster? For someone who is interested in migrating to Citus, what is involved in getting it deployed and moving the data out of an existing system? How do the different options for leveraging Citus compare to each other and how do you determine which features to release or withhold in the open source version? Are there any use cases that Citus enables which would be impractical to attempt in native Postgres? What have been some of the most challenging aspects of building the Citus extension? What are the situations where you would advise against using Citus? What are some of the most interesting or impressive uses of Citus that you have seen? What are some of the features that you have planned for future releases of Citus?

Contact Info

Citus Data

citusdata.com @citusdata on Twitter citusdata on GitHub

Craig

Email Website @craigkerstiens on Twitter

Ozgun

Email ozgune on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Citus Data PostGreSQL NoSQL Timescale SQL blog post PostGIS PostGreSQL Graph Database JSONB Data Type PipelineDB Timescale PostGres-XL Aurora PostGres Amazon RDS Streaming Replication CitusMX CTE (Common Table Expression) HipMunk Citus Sharding Blog Post Wal-e Wal-g Heap Analytics HyperLogLog C-Store

The intro and outro musi

Learning PostgreSQL 10 - Second Edition

2017-12-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Isabel Maria Duarte Rosa , Salahaldin Juba , Sheldon Strauch , Andrey Volkov

Python RDBMS SQL data data-engineering relational-databases

Dive into the world of PostgreSQL 10, one of the most widely used open-source database systems. This comprehensive guide will teach you the essential features and functionalities of PostgreSQL, enabling you to develop, manage, and optimize database systems with confidence and efficiency. What this Book will help me do Gain a foundational understanding of relational databases and PostgreSQL. Learn how to install, set up, and configure a PostgreSQL database system. Master SQL query writing, data manipulation, and advanced queries with PostgreSQL. Understand server-side programming with PL/pgSQL and define advanced schema objects. Optimize database performance, leverage advanced data types, and connect PostgreSQL with Python applications. Author(s) None Juba and None Volkov are seasoned experts in database management and software development. Their extensive experience with PostgreSQL ensures that each concept is explained practically and effectively. They aim to simplify complex topics for beginners and provide tips that are valuable for practitioners at various levels. Who is it for? This book is ideal for students, developers, and IT professionals who are new to PostgreSQL or wish to deepen their understanding of database technology. It caters to beginners looking to acquire foundational skills and database enthusiasts aiming to master PostgreSQL functionalities. Whether you're exploring database management for the first time or refining your existing skills, this guide is tailored for your needs.

PostgreSQL: Up and Running, 3rd Edition

2017-10-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Leo S. Hsu , Regina Obe

SQL XML data data-engineering relational-databases

Thinking of migrating to PostgreSQL? This clear, fast-paced introduction helps you understand and use this open source database system. Not only will you learn about the enterprise class features in versions 9.5 to 10, you’ll also discover that PostgeSQL is more than a database system—it’s an impressive application platform as well. With examples throughout, this book shows you how to achieve tasks that are difficult or impossible in other databases. This third edition covers new features, such as ANSI-SQL constructs found only in proprietary databases until now: foreign data wrapper (FDW) enhancements; new full text functions and operator syntax introduced in version 9.6; XML constructs new in version 10; query parallelization features introduced in 9.6 and enhanced in 10; built-in logical replication introduced in Version 10.e. If you’re a current PostgreSQL user, you’ll pick up gems you may have missed before. Learn basic administration tasks such as role management, database creation, backup, and restore Apply the psql command-line utility and the pgAdmin graphical administration tool Explore PostgreSQL tables, constraints, and indexes Learn powerful SQL constructs not generally found in other databases Use several different languages to write database functions Tune your queries to run as fast as your hardware will allow Query external and variegated data sources with foreign data wrappers Learn how to use built-in replication to replicate data

talk-data.com

Activity Trend

Top Events

Top Speakers

Power BI Data Analysis and Visualization

Putting Airflow Into Production With James Meickle - Episode 43

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

SQL Primer: An Accelerated Introduction to SQL Basics

CockroachDB In Depth with Peter Mattis - Episode 35

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

PostgreSQL 10 Administration Cookbook - Fourth Edition

Practical SQL

PostgreSQL 10 High Performance - Third Edition

PostGIS Cookbook - Second Edition

Beginning PostgreSQL on the Cloud: Simplifying Database as a Service on Cloud Platforms

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

SQL Queries for Mere Mortals: A Hands-On Guide to Data Manipulation in SQL, 4th Edition

Mastering PostgreSQL 10

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

Learning PostgreSQL 10 - Second Edition

PostgreSQL: Up and Running, 3rd Edition