Azure

SQL Server on Azure Virtual Machines

2020-06-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Joey D'Antoni , Tim Radney , Randolph West , Anthony Nocentino (Pure Storage) , Allan Hirt , John Martin , Louis Davidson

Cloud Computing Linux Microsoft SQL azure-sql-database data data-engineering relational-databases

Would you like to master deploying SQL Server in the cloud using Microsoft's Azure platform? With the hands-on guidance in this book, you'll explore how to set up and configure SQL Server on Azure Virtual Machines effectively. By the end, you'll have the knowledge to optimize, manage, and deploy your solutions. What this Book will help me do Understand platform availability for SQL Server in Azure Explore SQL Server IaaS and optimize its configuration Master deploying SQL Server on Linux and Windows in Azure Configure high-performance storage options tailored to SQL Server Learn disaster recovery strategies for SQL Server in Azure Author(s) Joey D'Antoni, Louis Davidson, Allan Hirt, and their co-authors bring years of experience in database management, cloud architecture, and technical writing. They aim to provide clear and actionable advice for working efficiently with SQL Server on Azure. Their insights come from real-world projects. Who is it for? This book is for developers, database administrators, and cloud architects who are looking to learn how to deploy SQL Server solutions on Azure Virtual Machines. If you are transitioning workloads to the cloud or need to manage or optimize such environments, this book will equip you with the skills you need. Basic SQL Server knowledge is helpful.

Building A Data Lake For The Database Administrator At Upsolver

2020-06-02 · Data Engineering Podcast Listen

podcast_episode

by Ori Rafael (Upsolver) , Tobias Macey , Yoni Iny (Upsolver)

AI/ML Analytics Flink CloudFormation AWS Lambda BigQuery Data Engineering Data Lake Data Management DWH GDPR/CCPA GitHub +7 more

Summary Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. Your host is Tobias Macey and today I’m interviewing Ori Rafael and Yoni Iny about building a data lake for the DBA at Upsolver

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing your definition of what a data lake is and what it is comprised of? We talked last in November of 2018. How has the landscape of data lake technologies and adoption changed in that time?

How has Upsolver changed or evolved since we last spoke?

How has the evolution of the underlying technologies impacted your implementation and overall product strategy?

What are some of the common challenges that accompany a data lake implementation? How do those challenges influence the adoption or viability of a data lake? How does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake?

What are the advantages of a data lake over a data warehouse if everything is being managed via SQL anyway?

What are some of the underlying realities of the data systems that power the lake which will eventually need to be understood by the operators of the platform? How is the SQL layer in Upsolver implemented?

What are the most challenging or complex aspects of managing the underlying technologies to provide automated partitioning, indexing, etc.?

What are the main concepts that you need to educate your customers on? What are some of the pitfalls that users should be aware of? What features of your platform are often overlooked or underutilized which you think should be more widely adopted? What have you found to be the most interesting, unexpected, or challenging lessons learned while building the technical and business elements of Upsolver? What do you have planned for the future?

Contact Info

Ori

Yoni

yoniiny on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Upsolver

Podcast Episode

DBA == Database Administrator IDF == Israel Defense Forces Data Lake Eventual Consistency Apache Spark Redshift Spectrum Azure Synapse Analytics SnowflakeDB

Podcast Episode

BigQuery Presto

Podcast Episode

Apache Kafka Cartesian Product kSQLDB

Podcast Episode

Eventador

Podcast Episode

Materialize

Podcast Episode

Common Table Expressions Lambda Architecture Kappa Architecture Apache Flink

Podcast Episode

Reinforcement Learning Cloudformation GDPR

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Introducing Microsoft SQL Server 2019

2020-04-27 · O'Reilly SQL Books O'Reilly Amazon

book

by James Rowland-Jones , Mitchell Pearson , Arun Sirpal , Dave Noderer , Dustin Ryan , Kellyn Gorman , Buck Woody , Allan Hirt

Analytics BI Big Data Cloud Computing Data Management Docker Hadoop HDFS Kubernetes Microsoft NoSQL Power BI +4 more

Introducing Microsoft SQL Server 2019 is the must-have guide for database professionals eager to leverage the latest advancements in SQL Server 2019. This book covers the features and capabilities that make SQL Server 2019 a powerful tool for managing and analyzing data both on-premises and in the cloud. What this Book will help me do Understand the new features introduced in SQL Server 2019 and their practical applications. Confidently manage and analyze relational, NoSQL, and big data within SQL Server 2019. Implement containerization for SQL Server using Docker and Kubernetes. Migrate and integrate your databases effectively to use Power BI Report Server. Query data from Hadoop Distributed File System with Azure Data Studio. Author(s) The authors of 'Introducing Microsoft SQL Server 2019' are subject matter experts including Kellyn Gorman, Allan Hirt, and others. With years of professional experience in database management and SQL Server, they bring a wealth of practical insight and knowledge to the book. Their experience spans roles as administrators, architects, and educators in the field. Who is it for? This book is aimed at database professionals such as DBAs, architects, and big data engineers who are currently using earlier versions of SQL Server or other database platforms. It is particularly well-suited for professionals aiming to understand and implement SQL Server 2019's new features. Readers should have basic familiarity with SQL Server and RDBMS concepts. If you're looking to explore SQL Server 2019 to improve data management and analytics in your organization, this book is for you.

SQL Server 2019 Administration Inside Out

2020-03-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by William Assaf , Sven Aelterman , Melody Zacharias , Randolph West , Joseph D'Antoni , Louis Davidson

API Big Data Cloud Computing Linux PowerShell SQL data data-engineering microsoft-sql-server relational-databases

Conquer SQL Server 2019 administration–from the inside out Dive into SQL Server 2019 administration–and really put your SQL Server DBA expertise to work. This supremely organized reference packs hundreds of timesaving solutions, tips, and workarounds–all you need to plan, implement, manage, and secure SQL Server 2019 in any production environment: on-premises, cloud, or hybrid. Six experts thoroughly tour DBA capabilities available in SQL Server 2019 Database Engine, SQL Server Data Tools, SQL Server Management Studio, PowerShell, and Azure Portal. You’ll find extensive new coverage of Azure SQL, big data clusters, PolyBase, data protection, automation, and more. Discover how experts tackle today’s essential tasks–and challenge yourself to new levels of mastery. Explore SQL Server 2019’s toolset, including the improved SQL Server Management Studio, Azure Data Studio, and Configuration Manager Design, implement, manage, and govern on-premises, hybrid, or Azure database infrastructures Install and configure SQL Server on Windows and Linux Master modern maintenance and monitoring with extended events, Resource Governor, and the SQL Assessment API Automate tasks with maintenance plans, PowerShell, Policy-Based Management, and more Plan and manage data recovery, including hybrid backup/restore, Azure SQL Database recovery, and geo-replication Use availability groups for high availability and disaster recovery Protect data with Transparent Data Encryption, Always Encrypted, new Certificate Management capabilities, and other advances Optimize databases with SQL Server 2019’s advanced performance and indexing features Provision and operate Azure SQL Database and its managed instances Move SQL Server workloads to Azure: planning, testing, migration, and post-migration

SAP on Azure Implementation Guide

2020-02-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Nick Morgan , Bartosz Jarkowski

Cloud Computing Microsoft SAP data data-engineering

SAP on Azure Implementation Guide is your essential companion for transitioning your SAP infrastructure to Microsoft Azure. The book takes a practical and detailed approach, providing step-by-step guidance to help you leverage Azure for migrating, scaling, and transforming your SAP solutions effectively. What this Book will help me do Understand and implement different SAP to Azure migration strategies, such as lift-and-shift and database transformations. Learn to ensure high availability and scalability for your SAP systems using Azure's capabilities. Gain insight into securing SAP workloads on Azure for compliance and safety. Achieve operational excellence by leveraging cloud-native features of Azure for SAP. Acquire the skills to optimize SAP infrastructure on Azure for enhanced business value. Author(s) Nick Morgan and Bartosz Jarkowski are experienced consultants with extensive knowledge of SAP systems and cloud implementations. With backgrounds in designing and deploying SAP on cloud platforms, they have a thorough understanding of transitioning business-critical applications to modern infrastructures. They bring a wealth of practical experience to this comprehensive guide. Who is it for? This book is ideal for SAP architects and IT professionals who are looking to migrate their SAP infrastructures to Azure. Whether you are moderately familiar with SAP or an experienced architect evaluating advanced migration strategies, you'll find the information in this guide precise and actionable to help you achieve your objectives.

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

2019-12-20 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Donna Strok , Dmitry Shirokov , Dmitry Anoshin

Analytics AWS BI Cloud Computing Data Analytics Databricks DWH ETL/ELT GCP Matillion Microsoft Cyber Security +4 more

Explore the modern market of data analytics platforms and the benefits of using Snowflake computing, the data warehouse built for the cloud. With the rise of cloud technologies, organizations prefer to deploy their analytics using cloud providers such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform. Cloud vendors are offering modern data platforms for building cloud analytics solutions to collect data and consolidate into single storage solutions that provide insights for business users. The core of any analytics framework is the data warehouse, and previously customers did not have many choices of platform to use. Snowflake was built specifically for the cloud and it is a true game changer for the analytics market. This book will help onboard you to Snowflake, present best practices to deploy, and use the Snowflake data warehouse. In addition, it covers modern analytics architecture and use cases. It provides use cases of integration with leading analytics software such as Matillion ETL, Tableau, and Databricks. Finally, it covers migration scenarios for on-premise legacy data warehouses. What You Will Learn Know the key functionalities of Snowflake Set up security and access with cluster Bulk load data into Snowflake using the COPY command Migrate from a legacy data warehouse to Snowflake integrate the Snowflake data platform with modern business intelligence (BI) and data integration tools Who This Book Is For Those working with data warehouse and business intelligence (BI) technologies, and existing and potential Snowflake users

PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond

2019-12-20 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Kevin Feasel

Cosmos ETL/ELT Hadoop Microsoft Oracle Spark SQL apache-spark data data-engineering

Harness the power of PolyBase data virtualization software to make data from a variety of sources easily accessible through SQL queries while using the T-SQL skills you already know and have mastered. PolyBase Revealed shows you how to use the PolyBase feature of SQL Server 2019 to integrate SQL Server with Azure Blob Storage, Apache Hadoop, other SQL Server instances, Oracle, Cosmos DB, Apache Spark, and more. You will learn how PolyBase can help you reduce storage and other costs by avoiding the need for ETL processes that duplicate data in order to make it accessible from one source. PolyBase makes SQL Server into that one source, and T-SQL is your golden ticket. The book also covers PolyBase scale-out clusters, allowing you to distribute PolyBase queries among several SQL Server instances, thus improving performance. With great flexibility comes great complexity, and this book shows you where to look when queries fail, complete with coverageof internals, troubleshooting techniques, and where to find more information on obscure cross-platform errors. Data virtualization is a key target for Microsoft with SQL Server 2019. This book will help you keep your skills current, remain relevant, and build new business and career opportunities around Microsoft’s product direction. What You Will Learn Install and configure PolyBase as a stand-alone service, or unlock its capabilities with a scale-out cluster Understand how PolyBase interacts with outside data sources while presenting their data as regular SQL Server tables Write queries combining data from SQL Server, Apache Hadoop, Oracle, Cosmos DB, Apache Spark, and more Troubleshoot PolyBase queries using SQL Server Dynamic Management Views Tune PolyBase queries using statistics and execution plans Solve common business problems, including "cold storage" of infrequentlyaccessed data and simplifying ETL jobs Who This Book Is For SQL Server developers working in multi-platform environments who want one easy way of communicating with, and collecting data from, all of these sources

Expert Performance Indexing in SQL Server 2019: Toward Faster Results and Lower Maintenance

2019-11-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jason Strate

SQL XML data data-engineering microsoft-sql-server relational-databases

Take a deep dive into perhaps the single most important facet of good performance: indexes, and how to best use them. Recent updates to SQL Server have made it possible to create indexes in situations that in the past would have prevented their use. Other improvements covered in this book include new dynamic management views, the ability to pause and resume index maintenance, and the ability to more easily recover from failures during index creation and maintenance operations. This new edition also brings new content around the indexing of columnstore and in-memory tables, showing how these new types of tables and the queries that execute against them can also benefit from good indexing practices. The book begins with explanations of the types of indexes and how they are stored in databases. Moving deeper into the topic, and further into the book, you will look at the statistics that are accumulated both by indexes and on indexes. You will better understand what indexes are doing in the database and what can be done to mitigate and improve their effect on performance. You will get a look at the Index Advisor now available in Azure SQL Database, and learn how to review and maintain the health of your indexes. The final chapters present a guided tour through a number of scenarios showing approaches you can take to investigate, mitigate, and improve the performance of your database. What You Will Learn Properly index row store, columnstore, and in-memory tables Review statistics to understand indexing choices made by the optimizer Apply indexing strategies such as covering indexes, included columns, and index intersections Recognize and remove unnecessary indexes Design effective indexes for full-text, spatial, and XML data types Manage the big picture: Encompass all indexes in adatabase, and all database instances on a server Who This Book Is For Database administrators and developers who are ready to lift the performance of their database environment by thoughtfully building indexes to speed up queries that matter the most and make a difference to the business

T-SQL Window Functions: For data analysis and beyond, 2nd Edition

2019-11-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Itzik Ben-Gan

BI Microsoft SQL data data-engineering

Use window functions to write simpler, better, more efficient T-SQL queries Most T-SQL developers recognize the value of window functions for data analysis calculations. But they can do far more, and recent optimizations make them even more powerful. In T-SQL Window Functions, renowned T-SQL expert Itzik Ben-Gan introduces breakthrough techniques for using them to handle many common T-SQL querying tasks with unprecedented elegance and power. Using extensive code examples, he guides you through window aggregate, ranking, distribution, offset, and ordered set functions. You'll find a detailed section on optimization, plus an extensive collection of business solutions — including novel techniques available in no other book. Microsoft MVP Itzik Ben-Gan shows how to: • Use window functions to improve queries you previously built with predicates • Master essential SQL windowing concepts, and efficiently design window functions • Effectively utilize partitioning, ordering, and framing • Gain practical in-depth insight into window aggregate, ranking, offset, and statistical functions • Understand how the SQL standard supports ordered set functions, and find working solutions for functions not yet available in the language • Preview advanced Row Pattern Recognition (RPR) data analysis techniques • Optimize window functions in SQL Server and Azure SQL Database, making the most of indexing, parallelism, and more • Discover a full library of window function solutions for common business problems About This Book • For developers, DBAs, data analysts, data scientists, BI professionals, and power users familiar with T-SQL queries • Addresses any edition of the SQL Server 2019 database engine or later, as well as Azure SQL Database Get all code samples at: MicrosoftPressStore.com/TSQLWindowFunctions/downloads

SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning

2019-10-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bob Ward (Azure Data)

AI/ML Analytics Big Data Data Analytics Data Lake ETL/ELT Hadoop HDFS Java Kubernetes Linux MongoDB +8 more

Get up to speed on the game-changing developments in SQL Server 2019. No longer just a database engine, SQL Server 2019 is cutting edge with support for machine learning (ML), big data analytics, Linux, containers, Kubernetes, Java, and data virtualization to Azure. This is not a book on traditional database administration for SQL Server. It focuses on all that is new for one of the most successful modernized data platforms in the industry. It is a book for data professionals who already know the fundamentals of SQL Server and want to up their game by building their skills in some of the hottest new areas in technology. SQL Server 2019 Revealed begins with a look at the project's team goal to integrate the world of big data with SQL Server into a major product release. The book then dives into the details of key new capabilities in SQL Server 2019 using a “learn by example” approach for Intelligent Performance, security, mission-criticalavailability, and features for the modern developer. Also covered are enhancements to SQL Server 2019 for Linux and gain a comprehensive look at SQL Server using containers and Kubernetes clusters. The book concludes by showing you how to virtualize your data access with Polybase to Oracle, MongoDB, Hadoop, and Azure, allowing you to reduce the need for expensive extract, transform, and load (ETL) applications. You will then learn how to take your knowledge of containers, Kubernetes, and Polybase to build a comprehensive solution called Big Data Clusters, which is a marquee feature of 2019. You will also learn how to gain access to Spark, SQL Server, and HDFS to build intelligence over your own data lake and deploy end-to-end machine learning applications. What You Will Learn Implement Big Data Clusters with SQL Server, Spark, and HDFS Create a Data Hub with connections to Oracle, Azure, Hadoop, and other sources Combine SQL and Spark to build a machine learning platform for AI applications Boost your performance with no application changes using Intelligent Performance Increase security of your SQL Server through Secure Enclaves and Data Classification Maximize database uptime through online indexing and Accelerated Database Recovery Build new modern applications with Graph, ML Services, and T-SQL Extensibility with Java Improve your ability to deploy SQL Server on Linux Gain in-depth knowledge to run SQL Server with containers and Kubernetes Know all the new database engine features for performance, usability, and diagnostics Use the latest tools and methods to migrate your database to SQL Server 2019 Apply your knowledge of SQL Server 2019 to Azure Who This Book Is For IT professionals and developers who understand the fundamentals of SQL Server and wish to focus on learning about the new, modern capabilities of SQL Server 2019. The book is for those who want to learn about SQL Server 2019 and the new Big Data Clusters and AI feature set, support for machine learning and Java, how to run SQL Server with containers and Kubernetes, and increased capabilities around Intelligent Performance, advanced security, and high availability.

Mastering SQL Server 2017

2019-08-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian Cote , Milos Radivojevic , William Durkin , Dejan Sarka , Matija Lah

AI/ML BI Docker DWH ETL/ELT JSON Linux Microsoft Python SQL SQL Server SSIS +4 more

Leverage the power of SQL Server 2017 Integration Services to build data integration solutions with ease Key Features Work with temporal tables to access information stored in a table at any time Get familiar with the latest features in SQL Server 2017 Integration Services Program and extend your packages to enhance their functionality Book Description Microsoft SQL Server 2017 uses the power of R and Python for machine learning and containerization-based deployment on Windows and Linux. By learning how to use the features of SQL Server 2017 effectively, you can build scalable apps and easily perform data integration and transformation. You'll start by brushing up on the features of SQL Server 2017. This Learning Path will then demonstrate how you can use Query Store, columnstore indexes, and In-Memory OLTP in your apps. You'll also learn to integrate Python code in SQL Server and graph database implementations for development and testing. Next, you'll get up to speed with designing and building SQL Server Integration Services (SSIS) data warehouse packages using SQL server data tools. Toward the concluding chapters, you'll discover how to develop SSIS packages designed to maintain a data warehouse using the data flow and other control flow tasks. By the end of this Learning Path, you'll be equipped with the skills you need to design efficient, high-performance database applications with confidence. This Learning Path includes content from the following Packt books: SQL Server 2017 Developer's Guide by Milos Radivojevic, Dejan Sarka, et. al SQL Server 2017 Integration Services Cookbook by Christian Cote, Dejan Sarka, et. al What you will learn Use columnstore indexes to make storage and performance improvements Extend database design solutions using temporal tables Exchange JSON data between applications and SQL Server Migrate historical data to Microsoft Azure by using Stretch Database Design the architecture of a modern Extract, Transform, and Load (ETL) solution Implement ETL solutions using Integration Services for both on-premise and Azure data Who this book is for This Learning Path is for database developers and solution architects looking to develop ETL solutions with SSIS, and explore the new features in SSIS 2017. Advanced analysis practitioners, business intelligence developers, and database consultants dealing with performance tuning will also find this book useful. Basic understanding of database concepts and T-SQL is required to get the best out of this Learning Path.

Professional Azure SQL Database Administration - Second Edition

2019-07-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ahmad Osama

Cloud Computing PowerShell Cyber Security SQL azure-sql-database data data-engineering relational-databases

Professional Azure SQL Database Administration serves as your comprehensive guide to mastering the management and optimization of cloud-based Azure SQL Database solutions. With the differences and unique features of Azure SQL Database compared to the on-premise SQL Server, this book offers a clear roadmap to efficiently migrate, secure, scale, and maintain these databases in the cloud. What this Book will help me do Understand the differences between Azure SQL Database and on-premise SQL Server and their practical implications. Learn techniques to migrate existing SQL Server databases to Azure SQL Database seamlessly. Discover advanced ways to optimize database performance and scalability leveraging cloud capabilities. Master security strategies for Azure SQL databases, including backup, disaster recovery, and automated tasks. Develop proficiency in using tools such as PowerShell to automate and manage routine database administration tasks. Author(s) Ahmad Osama is an experienced database professional and author specializing in SQL Server and Azure SQL Database administration. With a robust background in database migration, maintenance, and performance tuning, Ahmad expertly bridges the gap between theory and practice. His approachable writing style makes complex database topics accessible to professionals seeking to expand their expertise. Who is it for? Professional Azure SQL Database Administration is an essential resource for database administrators, developers, and IT professionals keen on developing their knowledge about Azure SQL Database administration and cloud database solutions. Whether you're transitioning from traditional SQL Server environments or looking to optimize your database strategies in the cloud, this book caters to professionals with intermediate to advanced experience in database management and programming with SQL.

Build Your Data Analytics Like An Engineer With DBT

2019-05-20 · Data Engineering Podcast Listen

podcast_episode

by Drew Banin (Fishtown Analytics) , Tobias Macey

AI/ML Airflow Analytics AWS BI Big Data BigQuery CI/CD Data Analytics Data Engineering Data Lake Data Management +17 more

Summary In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Drew Banin about DBT, the Data Build Tool, a toolkit for building analytics the way that developers build applications

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what DBT is and your motivation for creating it? Where does it fit in the overall landscape of data tools and the lifecycle of data in an analytics pipeline? Can you talk through the workflow for someone using DBT? One of the useful features of DBT for stability of analytics is the ability to write and execute tests. Can you explain how those are implemented? The packaging capabilities are beneficial for enabling collaboration. Can you talk through how the packaging system is implemented?

Are these packages driven by Fishtown Analytics or the dbt community?

What are the limitations of modeling everything as a SELECT statement? Making SQL code reusable is notoriously difficult. How does the Jinja templating of DBT address this issue and what are the shortcomings?

What are your thoughts on higher level approaches to SQL that compile down to the specific statements?

Can you explain how DBT is implemented and how the design has evolved since you first began working on it? What are some of the features of DBT that are often overlooked which you find particularly useful? What are some of the most interesting/unexpected/innovative ways that you have seen DBT used? What are the additional features that the commercial version of DBT provides? What are some of the most useful or challenging lessons that you have learned in the process of building and maintaining DBT? When is it the wrong choice? What do you have planned for the future of DBT?

Contact Info

Email @drebanin on Twitter drebanin on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

DBT Fishtown Analytics 8Tracks Internet Radio Redshift Magento Stitch Data Fivetran Airflow Business Intelligence Jinja template language BigQuery Snowflake Version Control Git Continuous Integration Test Driven Development Snowplow Analytics

Podcast Episode

dbt-utils We Can Do Better Than SQL blog post from EdgeDB EdgeDB Looker LookML

Podcast Interview

Presto DB

Podcast Interview

Spark SQL Hive Azure SQL Data Warehouse Data Warehouse Data Lake Data Council Conference Slowly Changing Dimensions dbt Archival Mode Analytics Periscope BI dbt docs dbt repository

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Learn T-SQL Querying

2019-05-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pam Lahoud , Pedro Lopes

Microsoft SQL data data-engineering

Dive into the world of T-SQL with 'Learn T-SQL Querying,' a book designed to enhance your database querying skills and help you master Microsoft's SQL Server and Azure SQL Database. Through this guide, you'll explore best practices, learn advanced techniques for analyzing execution plans, and create efficient T-SQL queries. What this Book will help me do Understand the fundamentals of query optimization to write performant T-SQL queries. Analyze query execution plans to identify and troubleshoot performance issues effectively. Utilize dynamic management views and functions to monitor and optimize query performance. Implement features like Query Store to streamline troubleshooting and maintain performance changes. Avoid common T-SQL anti-patterns and embrace best practices to ensure scalable query design. Author(s) Pedro Lopes and None Lahoud bring years of expertise in SQL Server and database systems. Pedro has extensive experience as a database engineer, where he specializes in query processing and optimization. None has a deep understanding of T-SQL development, focusing on practical solutions. Together, they provide in-depth insights and actionable advice. Who is it for? This book is perfect for database administrators, database developers, and data analysts at any level looking to improve their T-SQL expertise. Beginners will gain foundational skills in T-SQL querying, while experienced professionals will find advanced strategies for optimizing SQL Server performance. Readers aiming to master both practical querying and troubleshooting will benefit the most.

Performing Fast Data Analytics Using Apache Kudu - Episode 64

2019-01-07 · Data Engineering Podcast Listen

podcast_episode

by Brock Noland (PhData) , Jordan Birdsell (PhData) , Tobias Macey

Analytics Data Analytics Data Engineering Data Management ETL/ELT GitHub Hadoop Apache HBase HDFS Hive Iceberg Oracle +5 more

Summary

The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Kudu is and the motivation for building it?

How does it fit into the Hadoop ecosystem? How does it compare to the work being done on the Iceberg table format?

What are some of the common application and system design patterns that Kudu supports? How is Kudu architected and how has it evolved over the life of the project? There are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu? How does the storage layer in Kudu differ from what would be found in systems like Hive or HBase?

What are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance?

A number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure? What was the motivation for using C++ as the language target for Kudu?

If you were to start the project over today what would you do differently?

What are some situations where you would advise against using Kudu? What have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu? What are you most excited about for the future of Kudu?

Contact Info

Brock

LinkedIn @brocknoland on Twitter

Jordan

LinkedIn @jordanbirdsell jbirdsell on GitHub

PhData

Website phdata on GitHub @phdatainc on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Kudu PhData Getting Started with Apache Kudu Thomson Reuters Hadoop Oracle Exadata Slowly Changing Dimensions HDFS S3 Azure Blob Storage State Farm Stanly Black & Decker ETL (Extract, Transform, Load) Parquet

Podcast Episode

ORC HBase Spark

Podcast Episode

Hands-On Data Science with SQL Server 2017

2018-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vladimír Mužný , Marek Chmel

Analytics BI Big Data Data Science Power BI Python SQL data data-engineering

In "Hands-On Data Science with SQL Server 2017," you will discover how to implement end-to-end data analysis workflows, leveraging SQL Server's robust capabilities. This book guides you through collecting, cleaning, and transforming data, querying for insights, creating compelling visualizations, and even constructing predictive models for sophisticated analytics. What this Book will help me do Grasp the essential data science processes and how SQL Server supports them. Conduct data analysis and create interactive visualizations using Power BI. Build, train, and assess predictive models using SQL Server tools. Integrate SQL Server with R, Python, and Azure for enhanced functionality. Apply best practices for managing and transforming big data with SQL Server. Author(s) Marek Chmel and Vladimír Mužný bring their extensive experience in data science and database management to this book. Marek is a seasoned database specialist with a strong background in SQL, while Vladimír is known for his instructional expertise in analytics and data manipulation. Together, they focus on providing actionable insights and practical examples tailored for data professionals. Who is it for? This book is an ideal resource for aspiring and seasoned data scientists, data analysts, and database professionals aiming to deepen their expertise in SQL Server for data science workflows. Beginners with fundamental SQL knowledge will find it a guided entry into data science applications. It is especially suited for those who aim to implement data-driven solutions in their roles while leveraging SQL's capabilities.

Expert SQL Server Transactions and Locking: Concurrency Internals for SQL Server Practitioners

2018-10-08 · O'Reilly SQL Books O'Reilly Amazon

book

by Dmitri Korotkevitch

Microsoft SQL microsoft sql server

Master SQL Server’s Concurrency Model so you can implement high-throughput systems that deliver transactional consistency to your application customers. This book explains how to troubleshoot and address blocking problems and deadlocks, and write code and design database schemas to minimize concurrency issues in the systems you develop. SQL Server’s Concurrency Model is one of the least understood parts of the SQL Server Database Engine. Almost every SQL Server system experiences hard-to-explain concurrency and blocking issues, and it can be extremely confusing to solve those issues without a base of knowledge in the internals of the Engine. While confusing from the outside, the SQL Server Concurrency Model is based on several well-defined principles that are covered in this book. Understanding the internals surrounding SQL Server’s Concurrency Model helps you build high-throughput systems in multi-user environments. This book guides you through the Concurrency Model and elaborates how SQL Server supports transactional consistency in the databases. The book covers all versions of SQL Server, including Microsoft Azure SQL Database, and it includes coverage of new technologies such as In-Memory OLTP and Columnstore Indexes. What You'll Learn Know how transaction isolation levels affect locking behavior and concurrency Troubleshoot and address blocking issues and deadlocks Provide required data consistency while minimizing concurrency issues Design efficient transaction strategies that lead to scalable code Reduce concurrency problems through good schema design Understand concurrency models for In-Memory OLTP and Columnstore Indexes Reduce blocking during index maintenance, batch data load, and similar tasks Who This Book Is For SQL Server developers, database administrators, and application architects who are developing highly-concurrent applications. The book is for anyone interested in the technical aspects of creating and troubleshooting high-throughput systems that respond swiftly to user requests.

Power BI Data Analysis and Visualization

2018-09-10 · O'Reilly Data Visualization Books O'Reilly Amazon

book

by Suren Machiraju , Suraj Gaurav

BI CRM Dashboard DataViz ERP Microsoft Power BI SQL SQL Server Data Streaming business-intelligence data +4 more

Power BI Data Analysis and Visualization provides a roadmap to vendor choices and highlights why Microsoft’s Power BI is a very viable, cost effective option for data visualization. The book covers the fundamentals and most commonly used features of Power BI, but also includes an in-depth discussion of advanced Power BI features such as natural language queries; embedding Power BI dashboards; and live streaming data. It discusses real solutions to extract data from the ERP application, Microsoft Dynamics CRM, and also offers ways to host the Power BI Dashboard as an Azure application, extracting data from popular data sources like Microsoft SQL Server and open-source PostgreSQL. Authored by Microsoft experts, this book uses real-world coding samples and screenshots to spotlight how to create reports, embed them in a webpage, view them across multiple platforms, and more. Business owners, IT professionals, data scientists, and analysts will benefit from this thorough presentation of Power BI and its functions.

SQL Server 2017 Query Performance Tuning: Troubleshoot and Optimize Query Performance

2018-09-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Grant Fritchey

Microsoft SQL data data-engineering microsoft-sql-server relational-databases

Identify and fix causes of poor performance. You will learn Query Store, adaptive execution plans, and automated tuning on the Microsoft Azure SQL Database platform. Anyone responsible for writing or creating T-SQL queries will find valuable the insight into bottlenecks, including how to recognize them and eliminate them. This book covers the latest in performance optimization features and techniques and is current with SQL Server 2017. If your queries are not running fast enough and you’re tired of phone calls from frustrated users, then this book is the answer to your performance problems. SQL Server 2017 Query Performance Tuning is about more than quick tips and fixes. You’ll learn to be proactive in establishing performance baselines using tools such as Performance Monitor and Extended Events. You’ll recognize bottlenecks and defuse them before the phone rings. You’ll learn some quick solutions too, but emphasis is on designing for performance and getting it right. The goal is to head off trouble before it occurs. What You'll Learn Use Query Store to understand and easily change query performance Recognize and eliminate bottlenecks leading to slow performance Deploy quick fixes when needed, following up with long-term solutions Implement best practices in T-SQL to minimize performance risk Design in the performance that you need through careful query and index design Utilize the latest performance optimization features in SQL Server 2017 Protect query performance during upgrades to the newer versions of SQL Server Who This Book Is For Developers and database administrators with responsibility for application performance in SQL Server environments. Anyone responsible for writing or creating T-SQL queries will find valuable the insight into bottlenecks, including how to recognize them and eliminate them.

Cosmos DB for MongoDB Developers: Migrating to Azure Cosmos DB and Using the MongoDB API

2018-08-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Manish Sharma

API Cosmos MongoDB NoSQL data data-engineering nosql-databases

Learn Azure Cosmos DB and its MongoDB API with hands-on samples and advanced features such as the multi-homing API, geo-replication, custom indexing, TTL, request units (RU), consistency levels, partitioning, and much more. Each chapter explains Azure Cosmos DB’s features and functionalities by comparing it to MongoDB with coding samples. Cosmos DB for MongoDB Developers starts with an overview of NoSQL and Azure Cosmos DB and moves on to demonstrate the difference between geo-replication of Azure Cosmos DB compared to MongoDB. Along the way you’ll cover subjects including indexing, partitioning, consistency, and sizing, all of which will help you understand the concepts of read units and how this calculation is derived from an existing MongoDB’s usage. The next part of the book shows you the process and strategies for migrating to Azure Cosmos DB. You will learn the day-to-day scenarios of using Azure Cosmos DB, its sizing strategies, and optimizing techniques for the MongoDB API. This information will help you when planning to migrate from MongoDB or if you would like to compare MongoDB to the Azure Cosmos DB MongoDB API before considering the switch. What You Will Learn Migrate to MongoDB and understand its strategies Develop a sample application using MongoDB’s client driver Make use of sizing best practices and performance optimization scenarios Optimize MongoDB’s partition mechanism and indexing Who This Book Is For MongoDB developers who wish to learn Azure Cosmos DB. It specifically caters to a technical audience, working on MongoDB.

talk-data.com

Activity Trend

Top Events

Top Speakers

SQL Server on Azure Virtual Machines

Building A Data Lake For The Database Administrator At Upsolver

Introducing Microsoft SQL Server 2019

SQL Server 2019 Administration Inside Out

SAP on Azure Implementation Guide

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond

Expert Performance Indexing in SQL Server 2019: Toward Faster Results and Lower Maintenance

T-SQL Window Functions: For data analysis and beyond, 2nd Edition

SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning

Mastering SQL Server 2017

Professional Azure SQL Database Administration - Second Edition

Build Your Data Analytics Like An Engineer With DBT

Learn T-SQL Querying

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Hands-On Data Science with SQL Server 2017

Expert SQL Server Transactions and Locking: Concurrency Internals for SQL Server Practitioners

Power BI Data Analysis and Visualization

SQL Server 2017 Query Performance Tuning: Troubleshoot and Optimize Query Performance

Cosmos DB for MongoDB Developers: Migrating to Azure Cosmos DB and Using the MongoDB API