talk-data.com talk-data.com

Topic

SQL

Structured Query Language (SQL)

database_language data_manipulation data_definition programming_language

1751

tagged

Activity Trend

107 peak/qtr
2020-Q1 2026-Q1

Activities

1751 activities · Newest first

Pro DAX with Power BI: Business Intelligence with PowerPivot and SQL Server Analysis Services Tabular

Learn the intricate workings of DAX and the mechanics that are necessary to solve advanced Power BI challenges. This book is all about DAX (Data Analysis Expressions), the formula language used in Power BI—Microsoft’s leading self-service business intelligence application—and covers other products such as PowerPivot and SQL Server Analysis Services Tabular. You will learn how to leverage the advanced applications of DAX to solve complex tasks. Often a task seems complex due to a lack of understanding, or a misunderstanding of core principles, and how certain components interact with each other. The authors of this book use solutions and examples to teach you how to solve complex problems. They explain the intricate workings of important concepts such as Filter Context and Context Transition. You will learn how Power BI, through combining DAX building blocks (such as measures, table filtering, and data lineage), can yield extraordinary analytical power. Throughout Pro Dax with Power BI these building blocks are used to create and compose solutions for advanced DAX problems, so you can independently build solutions to your own complex problems, and gain valuable insight from your data. What You Will Learn Understand the intricate workings of DAX to solve advanced problems Deconstruct problems into manageable parts in order to create your own recipes Apply predefined solutions for addressing problems, and link back step-by-step to the mechanics of DAX, to know the foundation of this powerful query language Get fully on board with DAX, a new and evolving language, by learning best practices Who This Book Is For Anyone who wants to use Power BI to build advanced and complex models. Some experience writing DAX is helpful, but not essential if you have experience with other data query languages such as MDX or SQL.

Summary Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it?

How does it compare to the other available platforms for data warehousing? How does it differ from traditional data warehouses?

How does the performance and flexibility affect the data modeling requirements?

Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces? Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity?

What are some of the current limitations that you are struggling with?

For someone getting started with Snowflake what is involved with loading data into the platform?

What is their workflow for allocating and scaling compute capacity and running anlyses?

One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen? What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about? When is SnowflakeDB the wrong choice? What are some of the plans for the future of SnowflakeDB?

Contact Info

LinkedIn Website @KentGraziano on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

SnowflakeDB

Free Trial Stack Overflow

Data Warehouse Oracle DB MPP == Massively Parallel Processing Shared Nothing Architecture Multi-Cluster Shared Data Architecture Google BigQuery AWS Redshift AWS Redshift Spectrum Presto

Podcast Episode

SnowflakeDB Semi-Structured Data Types Hive ACID == Atomicity, Consistency, Isolation, Durability 3rd Normal Form Data Vault Modeling Dimensional Modeling JSON AVRO Parquet SnowflakeDB Virtual Warehouses CRM == Customer Relationship Management Master Data Management

Podcast Episode

FoundationDB

Podcast Episode

Apache Spark

Podcast Episode

SSIS == SQL Server Integration Services Talend Informatica Fivetran

Podcast Episode

Matillion Apache Kafka Snowpipe Snowflake Data Exchange OLTP == Online Transaction Processing GeoJSON Snowflake Documentation SnowAlert Splunk Data Catalog

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Mastering PostgreSQL 12 - Third Edition

Mastering PostgreSQL 12 delves into advanced features of PostgreSQL to help database professionals optimize, secure, and scale their database systems. Through practical examples, this book equips you with the necessary skills to address challenges in modern PostgreSQL environments. What this Book will help me do Gain expertise in PostgreSQL 12's advanced SQL functions and features. Master replication and backup techniques for scalable and fault-tolerant databases. Effectively optimize PostgreSQL queries and index utilization for performance gains. Enhance the security of PostgreSQL servers to ensure data integrity. Acquire hands-on experience in troubleshooting and resolving PostgreSQL-related issues. Author(s) Hans-Jürgen Schönig is a renowned database expert specializing in PostgreSQL. With years of experience in both database administration and development, he brings clarity to complex technical topics. His teaching approach emphasizes practical applications, making PostgreSQL's advanced features accessible for professionals. Who is it for? This book is ideal for PostgreSQL developers, administrators, and database professionals who have foundational knowledge and intend to enhance their expertise. Readers should be familiar with general database concepts and aim to master PostgreSQL's advanced functionalities. Whether you are handling enterprise environments or exploring data topology, this book serves as a vital resource.

Expert Performance Indexing in SQL Server 2019: Toward Faster Results and Lower Maintenance

Take a deep dive into perhaps the single most important facet of good performance: indexes, and how to best use them. Recent updates to SQL Server have made it possible to create indexes in situations that in the past would have prevented their use. Other improvements covered in this book include new dynamic management views, the ability to pause and resume index maintenance, and the ability to more easily recover from failures during index creation and maintenance operations. This new edition also brings new content around the indexing of columnstore and in-memory tables, showing how these new types of tables and the queries that execute against them can also benefit from good indexing practices. The book begins with explanations of the types of indexes and how they are stored in databases. Moving deeper into the topic, and further into the book, you will look at the statistics that are accumulated both by indexes and on indexes. You will better understand what indexes are doing in the database and what can be done to mitigate and improve their effect on performance. You will get a look at the Index Advisor now available in Azure SQL Database, and learn how to review and maintain the health of your indexes. The final chapters present a guided tour through a number of scenarios showing approaches you can take to investigate, mitigate, and improve the performance of your database. What You Will Learn Properly index row store, columnstore, and in-memory tables Review statistics to understand indexing choices made by the optimizer Apply indexing strategies such as covering indexes, included columns, and index intersections Recognize and remove unnecessary indexes Design effective indexes for full-text, spatial, and XML data types Manage the big picture: Encompass all indexes in adatabase, and all database instances on a server Who This Book Is For Database administrators and developers who are ready to lift the performance of their database environment by thoughtfully building indexes to speed up queries that matter the most and make a difference to the business

SQL Server Big Data Clusters: Early First Edition Based on Release Candidate 1

Get a head-start on learning one of SQL Server 2019’s latest and most impactful features—Big Data Clusters—that combines large volumes of non-relational data for analysis along with data stored relationally inside a SQL Server database. This book provides a first look at Big Data Clusters based upon SQL Server 2019 Release Candidate 1. Start now and get a jump on your competition in learning this important new feature. Big Data Clusters is a feature set covering data virtualization, distributed computing, and relational databases and provides a complete AI platform across the entire cluster environment. This book shows you how to deploy, manage, and use Big Data Clusters. For example, you will learn how to combine data stored on the HDFS file system together with data stored inside the SQL Server instances that make up the Big Data Cluster. Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019 using Release Candidate 1. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark. Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis. What You Will Learn Install, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments Analyze large volumes of data directly from SQL Server and/or Apache Spark Manage data stored in HDFS from SQL Server as if it were relational data Implement advanced analytics solutions through machine learning and AI Expose different data sources as a single logical source using data virtualization Who This Book Is For For data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environment

Summary With the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a data warehouse, or a data lake, or just leave your data wherever it currently rests. What’s worse is that any time you have to migrate to a new architecture, all of your analytical code has to change too. Thankfully it’s possible to add an abstraction layer to eliminate the churn in your client code, allowing you to evolve your data platform without disrupting your downstream data users. In this episode AtScale co-founder and CTO Matthew Baird describes how the data virtualization and data engineering automation capabilities that are built into the platform free up your engineers to focus on your business needs without having to waste cycles on premature optimization. This was a great conversation about the power of abstractions and appreciating the value of increasing the efficiency of your data team.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Matt Baird about AtScale, a platform that

Interview

Introduction How did you get involved in the area of data management? Can you start by describing the AtScale platform and how it fits in the ecosystem of data tools? What was your motivation for building the platform and what were some of the early challenges that you faced in achieving your current level of success? How is the AtScale platform architected and what have been some of the main areas of evolution and change since you first began building it?

How has the surrounding data ecosystem changed since AtScale was founded? How are current industry trends influencing your product focus?

Can you talk through the workflow for someone implementing AtScale? What are some of the main use cases that benefit from data virtualization capabilities?

How does it influence the relevancy of data warehouses or data lakes?

What are some of the types of tools or patterns that AtScale replaces in a data platform? What are some of the most interesting or unexpected ways that you have seen AtScale used? What have been some of the most challenging aspects of building and growing the platform? When is AtScale the wrong choice? What do you have planned for the future of the platform and business?

Contact Info

LinkedIn @zetty on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

AtScale PeopleSoft Oracle Hadoop PrestoDB Impala Apache Kylin Apache Druid Go Language Scala

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

T-SQL Window Functions: For data analysis and beyond, 2nd Edition

Use window functions to write simpler, better, more efficient T-SQL queries Most T-SQL developers recognize the value of window functions for data analysis calculations. But they can do far more, and recent optimizations make them even more powerful. In T-SQL Window Functions, renowned T-SQL expert Itzik Ben-Gan introduces breakthrough techniques for using them to handle many common T-SQL querying tasks with unprecedented elegance and power. Using extensive code examples, he guides you through window aggregate, ranking, distribution, offset, and ordered set functions. You'll find a detailed section on optimization, plus an extensive collection of business solutions — including novel techniques available in no other book. Microsoft MVP Itzik Ben-Gan shows how to: • Use window functions to improve queries you previously built with predicates • Master essential SQL windowing concepts, and efficiently design window functions • Effectively utilize partitioning, ordering, and framing • Gain practical in-depth insight into window aggregate, ranking, offset, and statistical functions • Understand how the SQL standard supports ordered set functions, and find working solutions for functions not yet available in the language • Preview advanced Row Pattern Recognition (RPR) data analysis techniques • Optimize window functions in SQL Server and Azure SQL Database, making the most of indexing, parallelism, and more • Discover a full library of window function solutions for common business problems About This Book • For developers, DBAs, data analysts, data scientists, BI professionals, and power users familiar with T-SQL queries • Addresses any edition of the SQL Server 2019 database engine or later, as well as Azure SQL Database Get all code samples at: MicrosoftPressStore.com/TSQLWindowFunctions/downloads

Summary The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Karen Heaton and Mark Sherwood-Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines.

Interview

Introduction How did you get involved in the are

Summary As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com today to find out more. Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sean Knapp about Ascend, which he is billing as an autonomous dataflow service

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what the Ascend

Oracle Database Application Security: With Oracle Internet Directory, Oracle Access Manager, and Oracle Identity Manager

Focus on the security aspects of designing, building, and maintaining a secure Oracle Database application. Starting with data encryption, you will learn to work with transparent data, back-up, and networks. You will then go through the key principles of audits, where you will get to know more about identity preservation, policies and fine-grained audits. Moving on to virtual private databases, you’ll set up and configure a VPD to work in concert with other security features in Oracle, followed by tips on managing configuration drift, profiles, and default users. Shifting focus to coding, you will take a look at secure coding standards, multi-schema database models, code-based access control, and SQL injection. Finally, you’ll cover single sign-on (SSO), and will be introduced to Oracle Internet Directory (OID), Oracle Access Manager (OAM), and Oracle Identity Management (OIM) by installing and configuring them to meet your needs. Oracle databases hold the majority of the world’s relational data, and are attractive targets for attackers seeking high-value targets for data theft. Compromise of a single Oracle Database can result in tens of millions of breached records costing millions in breach-mitigation activity. This book gets you ready to avoid that nightmare scenario. What You Will Learn Work with Oracle Internet Directory using the command-line and the console Integrate Oracle Access Manager with different applications Work with the Oracle Identity Manager console and connectors, while creating your own custom one Troubleshooting issues with OID, OAM, and OID Dive deep into file system and network security concepts Who This Book Is For Oracle DBAs and developers. Readers will need a basic understanding of Oracle RDBMS and Oracle Application Server to take complete advantage of this book.

Summary Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source system for building modern data applications

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Dagster is and the origin story for the project? In the tagline for Dagster you describe it as "a system for building modern data applications". There are a lot of contending terms that one might use in this context, such as ETL, data pipelines, etc. Can you describe your thinking as to what the term "data application" means, and the types of use cases that Dagster is well suited for? Can you talk through how Dagster is architected and some of the ways that it has evolved since you first began working on it?

What do you see as the current industry trends that are leading us away from full stack frameworks such as Airflow and Oozie for ETL and into an abstracted programming environment that is composable with different execution contexts? What are some of the initial assumptions that yo

Summary The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you mean by the term "Data Orchestration"?

How does it compare to the concept of "Data Virtualization"? What are some of the tools and platforms that fit under that umbrella?

What are some of the motivations for organizations to use the cloud for their data oriented workloads?

What are they giving up by using cloud resources in place of on-premises compute?

For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments? What are some of the common patterns for cloud migration projects and what challenges do they present?

Do you have advice on useful metrics to track for determining project completion or success criteria?

How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals? Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort?

What are some of the common pain points that organizations encounter when working on hybrid implementations?

What are some of the missing pieces in the data orchestration landscape?

Are there any efforts that you are aware of that are aiming to fill those gaps?

Where is the data orchestration market heading, and what are some industry trends that are driving it?

What projects are you most interested in or excited by?

For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend?

Contact Info

LinkedIn @dborkar on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Alluxio

Podcast Episode

UC San Diego Couchbase Presto

Podcast Episode

Spark SQL Data Orchestration Data Virtualization PyTorch

Podcast.init Episode

Rook storage orchestration PySpark MinIO

Podcast Episode

Kubernetes Openstack Hadoop HDFS Parquet Files

Podcast Episode

ORC Files Hive Metastore Iceberg Table Format

Podcast Episode

Data Orchestration Summit Star Schema Snowflake Schema Data Warehouse Data Lake Teradata

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Expert T-SQL Window Functions in SQL Server 2019: The Hidden Secret to Fast Analytic and Reporting Queries

Become an expert who can use window functions to solve T-SQL query problems. Replace slow cursors and self-joins with queries that are easy to write and perform better. This new edition provides expanded examples, including a chapter from the world of sports, and covers the latest performance enhancements through SQL Server 2019. Window functions are useful in analytics and business intelligence reporting. They came into full blossom with SQL Server 2012, yet they are not as well known and used as often as they ought to be. This group of functions is one of the most notable developments in SQL, and this book shows how every developer and DBA can benefit from their expressive power in solving day-to-day business problems. Once you begin using window functions, such as ROW_NUMBER and LAG, you will discover many ways to use them. You will approach SQL Server queries in a different way, thinking about sets of data instead of individual rows. Your querieswill run faster, be easier to write, and easier to deconstruct, maintain, and enhance in the future. Just knowing and using these functions is not enough. You also need to understand how to tune the queries. Expert T-SQL Window Functions in SQL Server clearly explains how to get the best performance. The book also covers the rare cases when older techniques are the best bet. What You Will Learn Solve complex query problems without cumbersome self-joins that run slowly and are difficult to read Create sliding windows in a result set for computing such as running totals and moving averages Return aggregate and detail data simultaneously from the same SELECT statement Compute lag and lead and other values that access data from multiple rows in a result set Understand the OVER clause syntax and how to control the window Avoid framing errors that can lead to unexpected results Who This Book Is For Anyone who writes T-SQL queries, including database administrators, developers, business analysts, and data scientists. Before reading this book, you should understand how to join tables, write WHERE clauses, and build aggregate queries.

Pro SQL Server 2019 Administration: A Guide for the Modern DBA

Use this comprehensive guide for the SQL Server DBA, covering all that practicing database administrators need to know to get their daily work done. Updated for SQL Server 2019, this edition includes coverage of new features such as Memory-optimized TempDB Metadata, and Always Encrypted with Secure Enclaves. Other new content includes coverage of Query Store, resumable index operations, installation on Linux, and containerized SQL. Pro SQL Server 2019 Administration takes DBAs on a journey that begins with planning their SQL Server deployment and runs through installing and configuring the instance, administering and optimizing database objects, and ensuring that data is secure and highly available. Finally, readers will learn how to perform advanced maintenance and tuning techniques. This book teaches you to make the most of new SQL Server 2019 functionality, including Data Discovery and Classification. The bookpromotes best-practice installation, shows how to configure for scalability and high workloads, and demonstrates the gamut of database-level maintenance tasks such as index maintenance, database consistency checks, and table optimizations. What You Will Learn Install and configure SQL Server on Windows through the GUI and with PowerShell Install and configure SQL Server on Linux and in Containers Optimize tables through in-memory OLTP, table partitioning, and the creation of indexes Secure and encrypt data to protect against embarrassing data breaches Ensure 24x7x365 access through high-availability and disaster recovery features Back up your data to ensure against loss, and recover data when needed Perform routine maintenance tasks such as database consistency checks Troubleshoot and solve performance problems inSQL queries and in the database engine Who This Book Is For SQL Server DBAs who manage on-premise installations of SQL Server. This book is also useful for DBAs who wish to learn advanced features such as Query Store, Extended Events, Distributed Replay, and Policy-Based Management, or those who need to install SQL Server in a variety of environments.

SAS Certified Professional Prep Guide

The official guide by the SAS Global Certification Program, SAS Certified Professional Prep Guide: Advanced Programming Using SAS 9.4 prepares you to take the new SAS 9.4 Advanced Programming Performance-Based Exam. New in this edition is a workbook whose sample scenarios require you to write code to solve problems and answer questions. Answers to the chapter quizzes and solutions to the sample scenarios in the workbook are included. You will also find links to exam objectives, practice exams, and other resources such as the Base SAS Glossary and a list of practice data sets. Major topics include SQL processing, SAS macro language processing, and advanced SAS programming techniques. All exam topics are covered in the following chapters: SQL Processing with SAS PROC SQL Fundamentals Creating and Managing Tables Joining Tables Using PROC SQL Joining Tables Using Set Operators Using Subqueries Advanced SQL Techniques SAS Macro Language Processing Creating and Using Macro Variables Storing and Processing Text Working with Macro Programs Advanced Macro Techniques Advanced SAS Programming Techniques Defining and Processing Arrays Processing Data Using Hash Objects Using SAS Utility Procedures Using Advanced Functions Practice Programming Scenarios (Workbook)

SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning

Get up to speed on the game-changing developments in SQL Server 2019. No longer just a database engine, SQL Server 2019 is cutting edge with support for machine learning (ML), big data analytics, Linux, containers, Kubernetes, Java, and data virtualization to Azure. This is not a book on traditional database administration for SQL Server. It focuses on all that is new for one of the most successful modernized data platforms in the industry. It is a book for data professionals who already know the fundamentals of SQL Server and want to up their game by building their skills in some of the hottest new areas in technology. SQL Server 2019 Revealed begins with a look at the project's team goal to integrate the world of big data with SQL Server into a major product release. The book then dives into the details of key new capabilities in SQL Server 2019 using a “learn by example” approach for Intelligent Performance, security, mission-criticalavailability, and features for the modern developer. Also covered are enhancements to SQL Server 2019 for Linux and gain a comprehensive look at SQL Server using containers and Kubernetes clusters. The book concludes by showing you how to virtualize your data access with Polybase to Oracle, MongoDB, Hadoop, and Azure, allowing you to reduce the need for expensive extract, transform, and load (ETL) applications. You will then learn how to take your knowledge of containers, Kubernetes, and Polybase to build a comprehensive solution called Big Data Clusters, which is a marquee feature of 2019. You will also learn how to gain access to Spark, SQL Server, and HDFS to build intelligence over your own data lake and deploy end-to-end machine learning applications. What You Will Learn Implement Big Data Clusters with SQL Server, Spark, and HDFS Create a Data Hub with connections to Oracle, Azure, Hadoop, and other sources Combine SQL and Spark to build a machine learning platform for AI applications Boost your performance with no application changes using Intelligent Performance Increase security of your SQL Server through Secure Enclaves and Data Classification Maximize database uptime through online indexing and Accelerated Database Recovery Build new modern applications with Graph, ML Services, and T-SQL Extensibility with Java Improve your ability to deploy SQL Server on Linux Gain in-depth knowledge to run SQL Server with containers and Kubernetes Know all the new database engine features for performance, usability, and diagnostics Use the latest tools and methods to migrate your database to SQL Server 2019 Apply your knowledge of SQL Server 2019 to Azure Who This Book Is For IT professionals and developers who understand the fundamentals of SQL Server and wish to focus on learning about the new, modern capabilities of SQL Server 2019. The book is for those who want to learn about SQL Server 2019 and the new Big Data Clusters and AI feature set, support for machine learning and Java, how to run SQL Server with containers and Kubernetes, and increased capabilities around Intelligent Performance, advanced security, and high availability.

Summary Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. Are you working on data, analytics, or AI using platforms such as Presto, Spark, or Tensorflow? Check out the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off of your ticket, and the first five people to register get free tickets! Register now as early bird tickets are ending this week! Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Lewis Hemens about DataForm, a platform that helps analy

The Little SAS Book, 6th Edition

A classic that just keeps getting better, The Little SAS Book is essential for anyone learning SAS programming. Lora Delwiche and Susan Slaughter offer a user-friendly approach so that readers can quickly and easily learn the most commonly used features of the SAS language. Each topic is presented in a self-contained, two-page layout complete with examples and graphics. Nearly every section has been revised to ensure that the sixth edition is fully up-to-date. This edition is also interface-independent, written for all SAS programmers whether they use SAS Studio, SAS Enterprise Guide, or the SAS windowing environment. New sections have been added covering PROC SQL, iterative DO loops, DO WHILE and DO UNTIL statements, %DO statements, using variable names with special characters, the ODS EXCEL destination, and the XLSX LIBNAME engine. This title belongs on every SAS programmer's bookshelf. It's a resource not just to get you started, but one you will return to as you continue to improve your programming skills. Learn more about the updates to The Little SAS Book, Sixth Edition here. Reviews for The Little SAS Book, Sixth Edition can be read here.

Summary The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Rockset is and your motivation for creating it?

What are some of the use cases that it enables which would otherwise be impractical or intractable?

How does Rockset fit into the infrastructure and workflow of data teams and what portions of a typical stack does it replace? Can you describe how the Rockset platform is architected and how it has evolved as you onboard more customers? Can you describe the flow of a piece of data as it traverses the full lifecycle in Rockset? How is your storage backend implemented to allow for speed and flexibility in the query layer?

How does it manage distribution, balancing, and durability of the data? What are your strategies for handling node and region failure in the cloud?

You have a whitepaper describing your ar

Query Store for SQL Server 2019: Identify and Fix Poorly Performing Queries

Apply the new Query Store feature to identify and fix poorly performing queries in SQL Server. Query Store is an important and recent feature in SQL Server that provides insight into the details of query execution and how that execution has changed over time. Query Store helps to identify queries that aren’t performing well, or that have regressed in their performance. Query Store provides detailed information such as wait stats that you need to resolve root causes, and it allows you to force the use of a known good execution plan. With SQL Server 2017 and later you can automate the correction of regressions in performance. Query Store for SQL Server 2019 helps you protect your database’s performance during upgrades of applications or version of SQL Server. The book provides fundamental information on how Query Store works and best practices for implementation and use. You will learn to run and interpret built-in reports, configure automatic plan correction, and troubleshoot queries using Query Store when needed. Query Store for SQL Server 2019 helps you master Query Store and bring value to your organization through consistent query execution times and automate correction of regressions. What You'll Learn Apply best practices in implementing Query Store on production servers Detect and correct regressions in query performance Lower the risk of performance degradation following an upgrade Use tools and techniques to get the most from Query Store Automate regression correction and other uses of Query Store Who This Book Is For SQL Server developers and administrators responsible for query performance on SQL Server. Anyone responsible for identifying poorly performing queries will be able to use Query Store to find these queries and resolve the underlying issues.