SQL

Teradata Cookbook

2018-02-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Abhinav Khandelwal , Rajsekhar Bhamidipati

Analytics DWH RDBMS Cyber Security Teradata data data-engineering relational-databases

Are you ready to master Teradata, one of the leading relational database management systems for data warehousing? In the "Teradata Cookbook," you will find over 85 recipes covering vital tasks like querying, performance tuning, and administrative operations. With clear and practical instructions, this book will equip you with the skills necessary to optimize data storage and analytics in your organization. What this Book will help me do Master Teradata's advanced features for efficient data warehousing applications. Understand and employ Teradata SQL for effective data manipulation and analytics. Explore practical solutions for Teradata administration tasks, including user and security management. Learn performance tuning techniques to enhance the efficiency of your queries and processes. Acquire detailed knowledge about Teradata's architecture and its unique capabilities. Author(s) The authors of "Teradata Cookbook" are experienced professionals in database management and data warehousing. With a deep understanding of Teradata's architecture and use in real-world applications, they bring a wealth of knowledge to each of the book's recipes. Their focus is to provide practical, actionable insights to help you tackle challenges you may face. Who is it for? This book is ideal for database administrators, data analysts, and professionals working with data warehousing who want to leverage the power of Teradata. Whether you are new to this database management system or looking to enhance your expertise, this cookbook provides practical solutions and in-depth insights, making it an essential resource.

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

2018-02-11 · Data Engineering Podcast Listen

podcast_episode

by Mike Freedman (Timescale) , Ajay Kulkarni (Timescale) , Tobias Macey

Amazon RDS Azure Cloud Computing Cloudflare Data Engineering Data Management Databricks DevOps Docker ELK GCP GitHub +14 more

Summary

As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Timescale is and how the project got started? The landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options? In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices? How is Timescale implemented and how has the internal architecture evolved since you first started working on it?

What impact has the 10.0 release of PostGreSQL had on the design of the project? Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL?

For someone who wants to start using Timescale what is involved in deploying and maintaining it? What are the axes for scaling Timescale and what are the points where that scalability breaks down?

Are you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances?

What has been the most challenging aspect of building and marketing Timescale? When is Timescale the wrong tool to use for time series data? One of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus? What are some of the most interesting uses of Timescale that you have seen? Which came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health? What features or improvements do you have planned for future releases of Timescale?

Contact Info

Ajay

LinkedIn @acoustik on Twitter Timescale Blog

Mike

Website LinkedIn @michaelfreedman on Twitter Timescale Blog

Timescale

Website @timescaledb on Twitter GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Timescale PostGreSQL Citus Timescale Design Blog Post MIT NYU Stanford SDN Princeton Machine Data Timeseries Data List of Timeseries Databases NoSQL Online Transaction Processing (OLTP) Object Relational Mapper (ORM) Grafana Tableau Kafka When Boring Is Awesome PostGreSQL RDS Google Cloud SQL Azure DB Docker Continuous Aggregates Streaming Replication PGPool II Kubernetes Docker Swarm Citus Data

Website Data Engineering Podcast Interview

Database Indexing B-Tree Index GIN Index GIST Index STE Energy Redis Graphite Prometheus pg_prometheus OpenMetrics Standard Proposal Timescale Parallel Copy Hadoop PostGIS KDB+ DevOps Internet of Things MongoDB Elastic DataBricks Apache Spark Confluent New Enterprise Associates MapD Benchmark Ventures Hortonworks 2σ Ventures CockroachDB Cloudflare EMC Timescale Blog: Why SQL is beating NoSQL, and what this means for the future of data

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss" target="_blank"…

SQL Queries for Mere Mortals: A Hands-On Guide to Data Manipulation in SQL, 4th Edition

2018-02-09 · O'Reilly SQL Books O'Reilly Amazon

book

by John L. Viescas

Microsoft MySQL Oracle RDBMS SQL Server postgresql

The #1 Easy, Common-Sense Guide to SQL Queries—Updated with More Advanced Techniques and Solutions Foreword by Keith W. Hare, Vice Chair, USA SQL Standards Committee SQL Queries for Mere Mortals has earned worldwide praise as the clearest, simplest tutorial on writing effective queries with the latest SQL standards and database applications. Now, author John L. Viescas has updated this hands-on classic with even more advanced and valuable techniques. Step by step, Viescas guides you through creating reliable queries for virtually any current SQL-based database. He demystifies all aspects of SQL query writing, from simple data selection and filtering to joining multiple tables and modifying sets of data. Building on the basics, Viescas shows how to solve challenging real-world problems, including applying multiple complex conditions on one table, performing sophisticated logical evaluations, and using unlinked tables to think “outside the box.” In two brand-new chapters, you learn how to perform complex calculations on groups for sophisticated reporting, and how to partition data into windows for more flexible aggregation. Practice all you want with downloadable sample databases for today’s versions of Microsoft Office Access, Microsoft SQL Server, and the open source MySQL and PostgreSQL databases. Whether you’re a DBA, developer, user, or student, there’s no better way to master SQL. Coverage includes: Getting started: understanding what relational databases are, and ensuring that your database structures are sound SQL basics: using SELECT statements, creating expressions, sorting information with ORDER BY, and filtering data using WHERE Summarizing and grouping data with GROUP BY and HAVING clauses Drawing data from multiple tables: using INNER JOIN, OUTER JOIN, and UNION operators, and working with subqueries Modifying data sets with UPDATE, INSERT, and DELETE statements Advanced queries: complex NOT and AND, conditions, if-then-else using CASE, unlinked tables, driver tables, and more NEW! Using advanced GROUP BY keywords to create subtotals, roll-ups, and more NEW! Applying window functions to answer more sophisticated questions, and gain deeper insight into your data Software-Independent Approach! If you work with database software such as Access, MS SQL Server, Oracle, DB2, MySQL, Ingres, or any other SQL-based program, this book could save you hours of time and aggravation—before you write a single query! .

Mastering PostgreSQL 10

2018-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hans-Jürgen Schönig

Cyber Security data data-engineering postgresql relational-databases

Mastering PostgreSQL 10 delves into the depths of PostgreSQL development and administration, guiding readers through advanced functionalities of the database. Covering topics such as query optimization, replication, high availability, and migration, this book equips you with the skills needed to harness the full power of PostgreSQL 10. What this Book will help me do Learn to optimize database queries to enhance performance in PostgreSQL 10. Understand advanced replication techniques and how to implement high availability. Gain expertise in managing security, backups and performing data migrations effectively. Explore query tuning and indexing strategies to speed up your database applications. Handle troubleshooting challenges by understanding problems and their solutions. Author(s) The authors of Mastering PostgreSQL 10 are experts in the field of databases, with years of experience in designing, developing, and managing PostgreSQL systems. They are passionate educators dedicated to helping professionals maximize their potential with PostgreSQL. Their practical and approachable style ensures that even complex topics are clearly explained. Who is it for? This book is ideal for PostgreSQL data architects and administrators who want to master advanced features of PostgreSQL 10. It is best suited for individuals who have prior database administration experience and a working knowledge of SQL. Readers aiming to enhance performance and implement transformations in their PostgreSQL setups will benefit immensely. Those tasked with ensuring high availability, migration, and recovery of PostgreSQL will find this book invaluable.

MySQL 8 Cookbook

2018-01-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Karthik Appigatla

Linux MySQL Cyber Security data data-engineering relational-databases

With "MySQL 8 Cookbook," dive into over 150 practical recipes tailored for database professionals aiming to master MySQL 8. You will explore setup, querying, and advanced features like security and performance tuning. This book is your comprehensive guide to efficient database handling in MySQL 8. What this Book will help me do Efficiently set up and configure a MySQL 8 environment. Master advanced querying techniques using new MySQL features such as CTEs and window functions. Execute robust data backup and recovery strategies with MySQL 8. Implement performance improvements with tools and features like descending indexes and query optimizers. Secure, manage, and optimize databases to support scalable, high-performance applications. Author(s) Karthik Appigatla is a seasoned database administrator and developer with extensive expertise in MySQL and relational database systems. With years of industry experience, he brings a practical perspective to database solutions. His passion is to empower learners by simplifying complex database concepts with a hands-on approach. Who is it for? This book is tailored for MySQL developers or administrators who seek ready solutions for their MySQL challenges. Whether you're upgrading to MySQL 8 or want to leverage its latest features, this cookbook is for you. Ideal for those with basic Linux and SQL experience aiming to build advanced MySQL knowledge and skills.

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

2018-01-08 · Data Engineering Podcast Listen

podcast_episode

by Ozgun Erdogan (Citus Data) , Craig Kerstiens (Citus Data) , Tobias Macey

Analytics Aurora Amazon RDS Big Data CI/CD Data Engineering Data Management GitHub Linux NoSQL Data Streaming postgresql

Summary

PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ozgun Erdogan and Craig Kerstiens about Citus, worry free PostGreSQL

Interview

Introduction How did you get involved in the area of data management? Can you describe what Citus is and how the project got started? Why did you start with Postgres vs. building something from the ground up? What was the reasoning behind converting Citus from a fork of PostGres to being an extension and releasing an open source version? How well does Citus work with other Postgres extensions, such as PostGIS, PipelineDB, or Timescale? How does Citus compare to options such as PostGres-XL or the Postgres compatible Aurora service from Amazon? How does Citus operate under the covers to enable clustering and replication across multiple hosts? What are the failure modes of Citus and how does it handle loss of nodes in the cluster? For someone who is interested in migrating to Citus, what is involved in getting it deployed and moving the data out of an existing system? How do the different options for leveraging Citus compare to each other and how do you determine which features to release or withhold in the open source version? Are there any use cases that Citus enables which would be impractical to attempt in native Postgres? What have been some of the most challenging aspects of building the Citus extension? What are the situations where you would advise against using Citus? What are some of the most interesting or impressive uses of Citus that you have seen? What are some of the features that you have planned for future releases of Citus?

Contact Info

Citus Data

citusdata.com @citusdata on Twitter citusdata on GitHub

Craig

Email Website @craigkerstiens on Twitter

Ozgun

Email ozgune on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Citus Data PostGreSQL NoSQL Timescale SQL blog post PostGIS PostGreSQL Graph Database JSONB Data Type PipelineDB Timescale PostGres-XL Aurora PostGres Amazon RDS Streaming Replication CitusMX CTE (Common Table Expression) HipMunk Citus Sharding Blog Post Wal-e Wal-g Heap Analytics HyperLogLog C-Store

The intro and outro musi

Learning Google BigQuery

2017-12-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Eric Brown , Thirukkumaran Haridass

Analytics API Big Data BigQuery Cloud Computing Data Analytics Data Science GCP Python Tableau data data-engineering +1 more

If you're ready to untap the potential of data analytics in the cloud, 'Learning Google BigQuery' will take you from understanding foundational concepts to mastering advanced techniques of this powerful platform. Through hands-on examples, you'll learn how to query and analyze massive datasets efficiently, develop custom applications, and integrate your results seamlessly with other tools. What this Book will help me do Understand the fundamentals of Google Cloud Platform and how BigQuery operates within it. Migrate enterprise-scale data seamlessly into BigQuery for further analytics. Master SQL techniques for querying large-scale datasets in BigQuery. Enable real-time data analytics and visualization with tools like Tableau and Python. Learn to create dynamic datasets, manage partition tables and use BigQuery APIs effectively. Author(s) None Berlyant, None Haridass, and None Brown are specialists with years of experience in data science, big data platforms, and cloud technologies. They bring their expertise in data analytics and teaching to make advanced concepts accessible. Their hands-on approach and real-world examples ensure readers can directly apply the skills they acquire to practical scenarios. Who is it for? This book is tailored for developers, analysts, and data scientists eager to leverage cloud-based tools for handling and analyzing large-scale datasets. If you seek to gain hands-on proficiency in working with BigQuery or want to enhance your organization's data capabilities, this book is a fit. No prior BigQuery knowledge is needed, just a willingness to learn.

XML and JSON Recipes for SQL Server: A Problem-Solution Approach

2017-12-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alex Grinberg

BI DWH JSON Microsoft Oracle SSAS SSIS XML data data-engineering storage-formats

Quickly find solutions to dozens of common problems encountered while using XML and JSON features that are built into SQL Server. Content is presented in the popular problem-solution format. Look up the problem that you want to solve. Read the solution. Apply the solution directly in your own code. Problem solved! This book shows how to take advantage of XML and JSON to share data and automate tasks. JSON is commonly used to move data back and forth between the database and front-end applications, often running in a browser. This book shows all you need to know about transforming query results into JSON format, and back again. Also covered are the processes and techniques for moving data into and out of XML format for business intelligence and other purposes, such as when transferring data from a reporting system into a data warehouse, or between different database brands such as between SQL Server and Oracle. Microsoft intensively implements XML in SQL Server, and in many related products. Execution plans are generated in XML format, and this book shows you how to parse those plans and automate the detection of performance problems. The relatively new Extended Events feature writes tracing data into XML files, and the recipes in this book help in parsing those files. XML is also used in SQL Server's BI tool set, including in SSIS, SSR, and SSAS. XML is used in many configuration files, and is even behind the construction of DDL triggers. In reading this book you’ll dive deeply into the features that allow you to build and parse XML, and also JSON, which is a specific format of XML used to transmit objects in a web-friendly format between a database and its front-end applications. What You Will Learn Build XML and JSON objects in support of automation and data transfer Import and parse XML and JSON from operating system files Build appropriate indexes on XML objects to improve query performance Move data from query result sets into JSON format, and back again Automate the detection of database performance problems by querying and parsing the database’s own execution plans Replace external and manual JSON processes with SQL Server's internal, JSON functionality Who This Book Is For Database administrators, .NET developers, business intelligence developers, and other professionals who want a deep and detailed skill set around working with XML and JSON in a SQL Server database environment. Web developers will particularly find the book useful for its coverage of transforming database result sets into JSON text that can be transmitted to front-end web applications.

SQL Server 2017 Administrator's Guide

2017-12-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Marek Chmel

Azure Cloud Computing data data-engineering microsoft-sql-server relational-databases

Dive into 'SQL Server 2017 Administrator's Guide' to master the administrative and maintenance aspects of SQL Server 2017. This comprehensive guide provides expert strategies and best practices to design, secure, and manage robust SQL Server systems effectively. What this Book will help me do Understand the new features and capabilities of SQL Server 2017 to enhance your database systems. Learn step-by-step how to configure, optimize, and troubleshoot SQL Server environments for maximum performance. Gain expertise in creating reliable backup and recovery solutions that minimize downtime and protect data. Develop skills in securing SQL Server instances against threats and maintaining system health. Explore integrating SQL Server 2017 with Azure and leveraging cloud capabilities for enhanced functionality. Author(s) The authors of 'SQL Server 2017 Administrator's Guide' are seasoned database administrators and experts in SQL Server technology. With years of practical experience, they have tackled challenges across various industries and bring a wealth of know-how to this book. They aim to provide clear, actionable guidance to help readers succeed. Who is it for? This book is ideal for database administrators who want to deepen their knowledge of SQL Server 2017 administration. It is especially suitable for professionals with some experience in earlier versions of SQL Server who wish to apply their skills to the latest edition. Whether you're an aspiring DBA or an experienced professional seeking to refine your strategies, this guide offers substantial value.

Exam Ref 70-765 Provisioning SQL Databases, First Edition

2017-12-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Joseph D'Antoni , Scott Klein

Azure Cloud Computing Microsoft data data-engineering microsoft-sql-server relational-databases transact-sql

Prepare for Microsoft Exam 70-765–and help demonstrate your real-world mastery of provisioning SQL Server databases both on premise and in SQL Azure. Designed for experienced IT professionals ready to advance their status, Exam Ref focuses on the critical thinking and decision-making acumen needed for success at the MCSA level. Focus on the expertise measured by these objectives: • Implement SQL in Azure • Manage databases and instances • Manage storage This Microsoft Exam Ref: • Organizes its coverage by exam objectives • Features strategic, what-if scenarios to challenge you • Assumes you have working knowledge of SQL Server administration and maintenance, as well as Azure skills Provisioning SQL Databases About the Exam Exam 70-765 focuses on skills and knowledge for provisioning, upgrading, and configuring SQL Server; managing databases and files; and provisioning, migrating, and managing databases in the Microsoft Azure cloud. About Microsoft Certification Passing this exam as well as Exam 70-764: Administering a SQL Database Infrastructure earns you MCSA: SQL 2016 Database Administration certification, qualifying you for a position as a database administrator or infrastructure specialist. See full details at: microsoft.com/learning

Learning PostgreSQL 10 - Second Edition

2017-12-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Isabel Maria Duarte Rosa , Salahaldin Juba , Sheldon Strauch , Andrey Volkov

Python RDBMS data data-engineering postgresql relational-databases

Dive into the world of PostgreSQL 10, one of the most widely used open-source database systems. This comprehensive guide will teach you the essential features and functionalities of PostgreSQL, enabling you to develop, manage, and optimize database systems with confidence and efficiency. What this Book will help me do Gain a foundational understanding of relational databases and PostgreSQL. Learn how to install, set up, and configure a PostgreSQL database system. Master SQL query writing, data manipulation, and advanced queries with PostgreSQL. Understand server-side programming with PL/pgSQL and define advanced schema objects. Optimize database performance, leverage advanced data types, and connect PostgreSQL with Python applications. Author(s) None Juba and None Volkov are seasoned experts in database management and software development. Their extensive experience with PostgreSQL ensures that each concept is explained practically and effectively. They aim to simplify complex topics for beginners and provide tips that are valuable for practitioners at various levels. Who is it for? This book is ideal for students, developers, and IT professionals who are new to PostgreSQL or wish to deepen their understanding of database technology. It caters to beginners looking to acquire foundational skills and database enthusiasts aiming to master PostgreSQL functionalities. Whether you're exploring database management for the first time or refining your existing skills, this guide is tailored for your needs.

Beginning XML with C# 7: XML Processing and Data Access for C# Developers

2017-11-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bipin Joshi

API C#/.NET Microsoft XML data data-engineering storage-formats

Master the basics of XML as well as the namespaces and objects you need to know in order to work efficiently with XML. You’ll learn extensive support for XML in everything from data access to configuration, from raw parsing to code documentation. You will see clear, practical examples that illustrate best practices in implementing XML APIs and services as part of your C#-based Windows 10 applications. Beginning XML with C# 7 is completely revised to cover the XML features of .NET Framework 4.7 using C# 7 programming language. In this update, you’ll discover the tight integration of XML with ADO.NET and LINQ as well as additional .NET support for today’s RESTful web services and Web API. Written by a Microsoft Most Valuable Professional and developer, this book demystifies everything to do with XML and C# 7. What You Will Learn: Discover how XML works with the .NET Framework Read, write, access, validate, and manipulate XML documents Transform XML with XSLT Use XML serialization and web services Combine XML in ADO.NET and SQL Server Create services using Windows Communication Foundation Work with LINQ Use XML with Web API and more Who This Book Is For : Those with experience in C# and .NET new to the nuances of using XML. Some XML experience is helpful.

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

2017-11-22 · Data Engineering Podcast Listen

podcast_episode

by Doug Cutting , Julien Le Dem (Astronomer) , Tobias Macey

Arrow Avro CI/CD CSV Data Engineering Data Management GitHub Hadoop Hive Linux Parquet Presto +3 more

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems.

Interview

Introduction How did you first get involved in the area of data management? What are the main serialization formats used for data storage and analysis? What are the tradeoffs that are offered by the different formats? How have the different storage and analysis tools influenced the types of storage formats that are available? You’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort? Why is it important for data engineers to carefully consider the format in which they transfer their data between systems?

What are the switching costs involved in moving from one format to another after you have started using it in a production system?

What are some of the new or upcoming formats that you are each excited about? How do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity?

Contact Information

Doug:

cutting on GitHub Blog @cutting on Twitter

Julien

Email @J_ on Twitter Blog julienledem on GitHub

Links

Apache Avro Apache Parquet Apache Arrow Hadoop Apache Pig Xerox Parc Excite Nutch Vertica Dremel White Paper

Twitter Blog on Release of Parquet

CSV XML Hive Impala Presto Spark SQL Brotli ZStandard Apache Drill Trevni Apache Calcite

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Exam Ref 70-767 Implementing a SQL Data Warehouse

2017-11-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Raj Uchhana , Jose Chinchilla

BI Data Quality DWH ETL/ELT Modern Data Stack Microsoft SSIS data data-engineering data-warehouse storage-repositories

Prepare for Microsoft Exam 70-767–and help demonstrate your real-world mastery of skills for managing data warehouses. This exam is intended for Extract, Transform, Load (ETL) data warehouse developers who create business intelligence (BI) solutions. Their responsibilities include data cleansing as well as ETL and data warehouse implementation. The reader should have experience installing and implementing a Master Data Services (MDS) model, using MDS tools, and creating a Master Data Manager database and web application. The reader should understand how to design and implement ETL control flow elements and work with a SQL Service Integration Services package. Focus on the expertise measured by these objectives: • Design, and implement, and maintain a data warehouse • Extract, transform, and load data • Build data quality solutionsThis Microsoft Exam Ref: • Organizes its coverage by exam objectives • Features strategic, what-if scenarios to challenge you • Assumes you have working knowledge of relational database technology and incremental database extraction, as well as experience with designing ETL control flows, using and debugging SSIS packages, accessing and importing or exporting data from multiple sources, and managing a SQL data warehouse. Implementing a SQL Data Warehouse About the Exam Exam 70-767 focuses on skills and knowledge required for working with relational database technology. About Microsoft Certification Passing this exam earns you credit toward a Microsoft Certified Professional (MCP) or Microsoft Certified Solutions Associate (MCSA) certification that demonstrates your mastery of data warehouse management Passing this exam as well as Exam 70-768 (Developing SQL Data Models) earns you credit toward a Microsoft Certified Solutions Associate (MCSA) SQL 2016 Business Intelligence (BI) Development certification. See full details at: microsoft.com/learning

Pro MySQL NDB Cluster

2017-11-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jesper Wisborg Krogh , Mikiya Okuno

API Java MySQL data data-engineering relational-databases

Create and run a real-time, highly-available, and high-redundancy version of the world's most popular open-source database, MySQL. You will understand the advantages and disadvantages of the MySQL NDB Cluster solution, and when MySQL NDB Cluster is the right choice. Pro MySQL NDB Cluster walks you through the full lifecycle of a MySQL Cluster installation: starting with the installation and initial configuration, moving through online configuration and schema changes, and completing with online upgrades. Along the way, you will learn to monitor your cluster, make decisions about schema design, implement geographic replication, troubleshoot and optimize performance, and much more. This book covers the many programming APIs that are supported by MySQL NDB Cluster. There's also robust coverage of connecting to MySQL NDB Cluster from Java, SQL, memcached, and even from C++. From any of these languages, you'll be able to connect and store and retrieve data as your applications demand. The book: Covers MySQL NDB Cluster concepts and architecture Takes you through the MySQL NDB Cluster lifecycle from installation to upgrades Guides you through DBA and Developer decisions when working with MySQL NDB Cluster What You'll Learn Understand the shared-nothing architecture behind MySQL NDB Cluster Plan, install, and configure a MySQL NDB Cluster environment Perform everyday tasks such as backing up, restoring, and upgrading Develop applications from Java, memcached, C++, and SQL Troubleshoot and resolve application performance problems Master enterprise-level features such the MySQL NDB Cluster Manager Who This Book Is For Database administrators and developers who are looking into deploying MySQL NDB Cluster, or who already have a cluster in production and want to increase their knowledge and ability to handle routine administrative tasks and troubleshooting. The book also is for those developers wanting to employ MySQL NDB Cluster as their chosen storage engine from Java, memcached, and C++ applications.

MariaDB and MySQL Common Table Expressions and Window Functions Revealed

2017-11-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Daniel Bartholomew

BI MariaDB MySQL data data-engineering relational-databases

Walk away from old-fashioned and cumbersome query approaches and answer your business intelligence questions through simple and powerful queries built on common table expressions (CTEs) and window functions. These new features in MariaDB and MySQL help you to write queries without having to wade through a quagmire of brittle self-joins and other crazy techniques from the past. Your queries will generate correct results, be more readable and less brittle in the face of unexpected data, and you’ll be able to adapt them quickly in the face of changing business requirements. MariaDB and MySQL Common Table Expressions and Window Functions Revealed introduces and explains CTEs and window functions, newly available in MariaDB 10.2 and MySQL 8.0, and helps you understand why and how every MariaDB and MySQL database programmer should learn and apply these features in their daily work. CTEs and especially window functions enable easy solutions to many query challenges that in prior releases have been difficult and sometimes impossible to surmount. Mastering these features opens the door to query solutions that are more robust, execute faster, and are easier to maintain over time than prior solutions using older techniques. The book: Takes you step-by-step through the workings of common table expressions and window functions Provides easy-to-follow examples of the new syntax Helps you answer business questions faster and easier than ever What You'll Learn Answer business questions using simple queries that don’t break in the face of unexpected data Avoid writing queries that are a difficult-to-maintain quagmire of self-joins and nested subqueries Recognize situations that call for window functions, and learn when to use these features Reduce the need for performance-robbing self-joins Simplify and speed the execution of analytical queries Create queries that finish in seconds instead of hours Who This Book Is For Database administrators and application developers who want to quickly get up to speed on important features in MariaDB and MySQL for writing business intelligence queries. Any developer writing SQL against MariaDB and MySQL databases will benefit tremendously from the knowledge and techniques this book provides.

The Biml Book: Business Intelligence and Data Warehouse Automation

2017-10-30 · O'Reilly Business Intelligence Books O'Reilly Amazon

book

by Cathrine Wilhelmsen , Simon Peck , Reeves Smith , Benjamin Weissman , Bill Fellows , Scott Currie , Peter Avenant , Andy Leonard , Jacob Alley , Raymond Sondak , Martin Andersson

BI DevOps DWH Microsoft SSAS SSIS business-intelligence data data-science

Learn Business Intelligence Markup Language (Biml) for automating much of the repetitive, manual labor involved in data integration. We teach you how to build frameworks and use advanced Biml features to get more out of SQL Server Integration Services (SSIS), Transact-SQL (T-SQL), and SQL Server Analysis Services (SSAS) than you ever thought possible. The first part of the book starts with the basics—getting your development environment configured, Biml syntax, and scripting essentials. Whether a beginner or a seasoned Biml expert, the next part of the book guides you through the process of using Biml to build a framework that captures both your design patterns and execution management. Design patterns are reusable code blocks that standardize the approach you use to perform certain types of data integration, logging, and other key data functions. Design patterns solve common problems encountered when developing data integration solutions. Because you do not have to build the code from scratch each time, design patterns improve your efficiency as a Biml developer. In addition to leveraging design patterns in your framework, you will learn how to build a robust metadata store and how to package your framework into Biml bundles for deployment within your enterprise. In the last part of the book, we teach you more advanced Biml features and capabilities, such as SSAS development, T-SQL recipes, documentation autogeneration, and Biml troubleshooting. The Biml Book: Provides practical and applicable examples Teaches you how to use Biml to reduce development time while improving quality Takes you through solutions to common data integration and BI challenges What You'll Learn Master the basics of Business Intelligence Markup Language (Biml) Study patterns for automating SSIS package generation Build a Biml Framework Import and transform database schemas Automate generation of scripts and projects Who This Book Is For BI developers wishing to quickly locate previously tested solutions, Microsoft BI specialists, those seeking more information about solution automation and code generation, and practitioners of Data Integration Lifecycle Management (DILM) in the DevOps enterprise

PHP & MySQL: Novice to Ninja, 6th Edition

2017-10-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Kevin Yank , Tom Butler

Linux MySQL data data-engineering relational-databases

PHP & MySQL: Novice to Ninja, 6th Edition is a hands-on guide to learning all the tools, principles, and techniques needed to build a fully functional application using PHP & MySQL. Comprehensively updated to cover PHP 7 and modern best practice, this practical and fun book covers everything from installing PHP and MySQL through to creating a complete online content management system. You'll learn how to: Install PHP & MySQL on Windows, Mac OS X, or Linux Gain a thorough understanding of PHP syntax Use object oriented programming techniques Master database design principles and SQL Develop robust websites that can handle high levels of traffic Build a working content management system (CMS) And much more!

Pandas Cookbook

2017-10-23 · O'Reilly Data Science Books O'Reilly Amazon

book

by Theodore Petrou , Kuntal Ganguly

Data Science Matplotlib Pandas Python Seaborn data data-science data-science-tools

The Pandas Cookbook offers a collection of practical recipes for mastering data manipulation, analysis, and visualization tasks using pandas. Through a methodological and hands-on approach, you will learn to utilize pandas for handling real-world datasets efficiently. By the end of this book, you will be able to solve complex data science problems and create insightful visual representations in Python. What this Book will help me do Understand the core functionalities of pandas 0.20 for exploring datasets effectively. Master filtering, selecting, and transforming data for targeted analysis. Leverage pandas' features for aggregating and transforming grouped data. Restructure data for analysis and create professional visualizations using integration with Seaborn and Matplotlib. Gain expertise in handling time series data and SQL-like merging operations. Author(s) Theodore Petrou, the author of the Pandas Cookbook, is a data scientist and Python expert with extensive experience teaching and using pandas in professional settings. Known for his practical approach, he meticulously explains each recipe and includes comprehensive examples and datasets in Jupyter notebooks to enhance your learning experience. Who is it for? This book is aimed at data scientists, Python developers, and analysts seeking an in-depth, practical guide to mastering data analysis with pandas. Whether you're a beginner with some knowledge of Python or an experienced analyst looking to refine your skills, this cookbook provides valuable insights and techniques for your data-driven tasks.

PostgreSQL: Up and Running, 3rd Edition

2017-10-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Leo S. Hsu , Regina Obe

XML data data-engineering postgresql relational-databases

Thinking of migrating to PostgreSQL? This clear, fast-paced introduction helps you understand and use this open source database system. Not only will you learn about the enterprise class features in versions 9.5 to 10, you’ll also discover that PostgeSQL is more than a database system—it’s an impressive application platform as well. With examples throughout, this book shows you how to achieve tasks that are difficult or impossible in other databases. This third edition covers new features, such as ANSI-SQL constructs found only in proprietary databases until now: foreign data wrapper (FDW) enhancements; new full text functions and operator syntax introduced in version 9.6; XML constructs new in version 10; query parallelization features introduced in 9.6 and enhanced in 10; built-in logical replication introduced in Version 10.e. If you’re a current PostgreSQL user, you’ll pick up gems you may have missed before. Learn basic administration tasks such as role management, database creation, backup, and restore Apply the psql command-line utility and the pgAdmin graphical administration tool Explore PostgreSQL tables, constraints, and indexes Learn powerful SQL constructs not generally found in other databases Use several different languages to write database functions Tune your queries to run as fast as your hardware will allow Query external and variegated data sources with foreign data wrappers Learn how to use built-in replication to replicate data

talk-data.com

Activity Trend

Top Events

Top Speakers

Teradata Cookbook

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

SQL Queries for Mere Mortals: A Hands-On Guide to Data Manipulation in SQL, 4th Edition

Mastering PostgreSQL 10

MySQL 8 Cookbook

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

Learning Google BigQuery

XML and JSON Recipes for SQL Server: A Problem-Solution Approach

SQL Server 2017 Administrator's Guide

Exam Ref 70-765 Provisioning SQL Databases, First Edition

Learning PostgreSQL 10 - Second Edition

Beginning XML with C# 7: XML Processing and Data Access for C# Developers

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Exam Ref 70-767 Implementing a SQL Data Warehouse

Pro MySQL NDB Cluster

MariaDB and MySQL Common Table Expressions and Window Functions Revealed

The Biml Book: Business Intelligence and Data Warehouse Automation

PHP & MySQL: Novice to Ninja, 6th Edition

Pandas Cookbook

PostgreSQL: Up and Running, 3rd Edition