talk-data.com talk-data.com

Topic

Azure

Microsoft Azure

cloud cloud_provider microsoft infrastructure

723

tagged

Activity Trend

278 peak/qtr
2020-Q1 2026-Q1

Activities

723 activities · Newest first

Summary The perennial challenge of data engineers is ensuring that information is integrated reliably. While it is straightforward to know whether a synchronization process succeeded, it is not always clear whether every record was copied correctly. In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. In this episode they explain how the utility is implemented to run quickly and how you can start using it in your own data workflows to ensure that your data warehouse isn’t missing any records from your source systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try! Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or

Come hang with Airflow practitioners from around the world using Airflow AND other data tools to power their data practice. From Databricks to Glue to Azure Data Factory, smart businesses make the right decision to standardize on Airflow for what it’s best at while using the other systems for what they are best at.

Summary The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to make an MDM project successful both at the outset and over the long term.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance. In this episode Prukalpa Sankar joins the show to talk about the work she and her team at Atlan are doing to push this capability into the mainstream.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. Your host is Tobias Macey and today I’m interviewing Prukalpa Sankar about how data platforms can benefit from the idea of "active metadata" and the work that she and her team at Atlan are doing to make it a reality

Interview

Introduction How did you get involved in the area of data management? Can you describe what "active metadata" is and how it differs from the current approaches to metadata systems? What are some of the use cases that "active metadata" can enable for data producers and consumers?

What are the points of friction that those users encounter in the current formulation of metadata systems?

Central metadata systems/data catalogs came about as a solution to the challenge of integrating every data tool with every other data tool, giving a single place to integrate. What are the lessons that are being learned from the "modern data stack" that can be applied to centralized metadata? Can you describe the approach that you are taking at Atlan to enable the adoption of "active metadata"?

What are the architectural capabilities that you had to build to power the outbound traffic flows?

How are you addressing the N x M integration problem for pushing metadata into the necessary contexts at Atlan?

What are the interfaces that are necessary for receiving systems to be able to make use of the metadata that is being delivered? How does the type/category of metadata impact the type of integration that is necessary?

What are some of the automation possibilities that metadata activation offers for data teams?

What are the cases where you still need a human in the loop?

What are the most interesting, innovative, or unexpected ways that you have seen active metadata capabilities used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on activating metadata for your users? When is an active approach to metadata the wrong choice? What do you have planned for the future of Atlan and active metadata?

Contact Info

LinkedIn @prukalpa on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Atlan What is Active Metadata? Segment

Podcast Episode

Zapier ArgoCD Kubernetes Wix AWS Lambda Modern Data Culture Blog Post

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Unstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time and money on managing your data assets. In this episode Rod Christensen shares the story behind Aparavi and how you can use it to cut costs and gain value for the long tail of your unstructured data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Rod Christensen about Aparavi, a platform designed to find and unlock the value of data, no matter where it lives

Interview

Introduction How did you get involved in the area of data management? Can you describe what Aparavi is and the story behind it? Who are the target customers for Aparavi and how does that inform your product roadmap and messaging? What are some of th

Summary The best way to make sure that you don’t leak sensitive data is to never have it in the first place. The team at Skyflow decided that the second best way is to build a storage system dedicated to securely managing your sensitive information and making it easy to integrate with your applications and data systems. In this episode Sean Falconer explains the idea of a data privacy vault and how this new architectural element can drastically reduce the potential for making a mistake with how you manage regulated or personally identifiable information.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking all of that information into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how you can take advantage of active metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Sean Falconer about the idea of a data privacy vault and how the Skyflow team are working to make it turn-key

Interview

Introduction How did you get involved in the area of data management? Can you describe what Skyflow is and the story behind it? What is a "data privacy vault" and how does it differ from strategies such as privacy engineering or existing data governance patterns? What are the primary use cases and capabilities that you are focused on solving for with Skyflow?

Who is the target customer for Skyflow (e.g. how does it enter an organization)?

How is the Skyflow platform architected?

How have the design and goals of the system changed or evolved over time?

Can you describe the process of integrating with Skyflow at the application level? For organizations that are building analytical capabilities on top of the data managed in their applications, what are the interactions with Skyflow at each of the stages in the data lifecycle? One of the perennial problems with distributed systems is the challenge of joining data across machine boundaries. How do you mitigate that problem? On your website there are different "vaults" advertised in the form of healthcare, fintech, and PII. What are the different requirements across each of those problem domains?

What are the commonalities?

As a relatively new company in an emerging product category, what are some of the customer education challenges that you are facing? What are the most interesting, innovative, or unexpected ways that you have seen Skyflow used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Skyflow? When is Skyflow the wrong choice? What do you have planned for the future of Skyflow?

Contact Info

LinkedIn @seanfalconer on Twitter Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

Skyflow Privacy Engineering Data Governance Homomorphic Encryption Polymorphic Encryption

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary A large fraction of data engineering work involves moving data from one storage location to another in order to support different access and query patterns. Singlestore aims to cut down on the number of database engines that you need to run so that you can reduce the amount of copying that is required. By supporting fast, in-memory row-based queries and columnar on-disk representation, it lets your transactional and analytical workloads run in the same database. In this episode SVP of engineering Shireesh Thota describes the impact on your overall system architecture that Singlestore can have and the benefits of using a cloud-native database engine for your next application.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you becom

SAP S/4HANA Systems in Hyperscaler Clouds: Deploying SAP S/4HANA in AWS, Google Cloud, and Azure

This book helps SAP architects and SAP Basis administrators deploy and operate SAP S/4HANA systems on the most common public cloud platforms. Market-leading cloud offerings are covered, including Amazon Web Services, Microsoft Azure, and Google Cloud. You will gain an end-to-end understanding of the initial implementation of SAP S/4HANA systems on those platforms. You will learn how to move away from the big monolithic SAP ERP systems and arrive at an environment with a central SAP S/4HANA system as the digital core surrounded by cloud-native services. The book begins by introducing the core concepts of Hyperscaler cloud platforms that are relevant to SAP. You will learn about the architecture of SAP S/4HANA systems on public cloud platforms, with specific content provided for each of the major platforms. The book simplifies the deployment of SAP S/4HANA systems in public clouds by providing step-by-step instructions and helping you deal with thecomplexity of such a deployment. Content in the book is based on best practices, industry lessons learned, and architectural blueprints, helping you develop deep insights into the operations of SAP S/4HANA systems on public cloud platforms. Reading this book enables you to build and operate your own SAP S/4HANA system in the public cloud with a minimum of effort. What You Will Learn Choose the right Hyperscaler platform for your future SAP S/4HANA workloads Start deploying your first SAP S/4HANA system in the public cloud Avoid typical pitfalls during your implementation Apply and leverage cloud-native services for your SAP S/4HANA system Save costs by choosing the right architecture and build a robust architecture for your most critical SAP systems Meet your business’ criteria for availability and performance by having the right sizing in place Identify further use cases whenoperating SAP S/4HANA in the public cloud Who This Book Is For SAP architects looking for an answer on how to move SAP S/4HANA systems from on-premises into the cloud; those planning to deploy to one of the three major platforms from Amazon Web Services, Microsoft Azure, and Google Cloud Platform; and SAP Basis administrators seeking a detailed and realistic description of how to get started on a migration to the cloud and how to drive that cloud implementation to completion

SQL Server Advanced Troubleshooting and Performance Tuning

This practical book provides a comprehensive overview of troubleshooting and performance tuning best practices for Microsoft SQL Server. Database engineers, including database developers and administrators, will learn how to identify performance issues, troubleshoot the system in a holistic fashion, and properly prioritize tuning efforts to attain the best system performance possible. Author Dmitri Korotkevitch, Microsoft Data Platform MVP and Microsoft Certified Master (MCM), explains the interdependencies between SQL Server database components. You'll learn how to quickly diagnose your system and discover the root cause of any issue. Techniques in this book are compatible with all versions of SQL Server and cover both on-premises and cloud-based SQL Server installations. Discover how performance issues present themselves in SQL Server Learn about SQL Server diagnostic tools, methods, and technologies Perform health checks on SQL Server installations Learn the dependencies between SQL Server components Tune SQL Server to improve performance and reduce bottlenecks Detect poorly optimized queries and inefficiencies in query execution plans Find inefficient indexes and common database design issues Use these techniques with Microsoft Azure SQL databases, Azure SQL Managed Instances, and Amazon RDS for SQL Server

Artificial Intelligence with Power BI

Discover how to enhance your data analysis with 'Artificial Intelligence with Power BI,' a resource designed to teach you how to leverage Power BI's AI capabilities. You will learn practical methods for enriching your analytics with forecasting, anomaly detection, and machine learning, equipping you to create intelligent, insightful BI reports. What this Book will help me do Learn how to apply AI capabilities such as forecasting and anomaly detection to enrich your reports and drive actionable insights. Explore data preparation techniques optimized for AI, ensuring your datasets are structured for advanced analytics. Develop skills to integrate Azure Machine Learning and Cognitive Services into Power BI, expanding your analytical toolset. Understand how to build Q&A interfaces and integrate Natural Language Processing into your BI solutions. Gain expertise in training and deploying your own machine learning models to achieve tailored insights and predictive analytics. Author(s) None Diepeveen is an experienced data analyst and Power BI expert with a passion for making advanced analytics accessible to professionals. With years of hands-on experience working in the data analytics field, they deliver insights using intuitive, practical approaches through clear and engaging tutorials. Who is it for? This book is ideal for data analysts and BI developers who aim to expand their analytics capabilities with AI. Readers should already be familiar with Power BI and are looking for a resource to teach them how to incorporate predictive and advanced AI techniques into their reporting workflow. Whether you're seeking to gain a professional edge or enhance your organization's data storytelling and insights, this guide is perfect for you.

Microsoft Power BI Performance Best Practices

"Microsoft Power BI Performance Best Practices" is a thorough guide to mastering efficiently operating Power BI solutions. This book walks you through optimizing every layer of a Power BI project, from data transformations to architecture, equipping you with the ability to create robust and scalable analytics solutions. What this Book will help me do Understand how to set realistic performance goals for Power BI projects and implement ongoing performance monitoring. Apply effective architectural and configuration strategies to improve Power BI solution efficiency. Learn practices for constructing and optimizing data models and implementing Row-Level Security effectively. Utilize tools like DAX Studio and VertiPaq Analyzer to detect and resolve common performance bottlenecks. Gain deep knowledge of Power BI Premium and techniques for handling large-scale data solutions using Azure. Author(s) Bhavik Merchant is a recognized expert in business intelligence and analytics solutions. With extensive experience in designing and implementing Power BI solutions across industries, he brings a pragmatic approach to solving performance issues in Power BI. Bhavik's writing style reflects his passion for teaching, ensuring readers gain practical knowledge they can directly apply to their work. Who is it for? This book is designed for data analysts, BI developers, and data professionals who have foundational knowledge of Power BI and aim to elevate their skills to construct high-performance analytics solutions. It is particularly suited to individuals seeking guidance on best practices and tools for optimizing Power BI applications.

Reproducible Data Science with Pachyderm

Dive into the world of reproducible data science with Pachyderm, a specialized platform designed for version-controlled data pipelines. By following this book, 'Reproducible Data Science with Pachyderm,' you'll gain the skills to implement robust, scalable machine learning workflows with Pachyderm 2.0, covering setup, integration, and advanced use cases. What this Book will help me do Build scalable, version-controlled data pipelines with Pachyderm's unique features. Understand the principles behind reproducible data science and implement them effectively. Deploy Pachyderm on AWS, Google Cloud, and Azure while integrating with popular tools. Create and manage end-to-end machine learning workflows, including hyperparameter tuning. Leverage advanced integrations, such as Pachyderm Notebooks and language clients like Python and Go. Author(s) Svetlana Karslioglu is a seasoned data scientist with extensive experience in constructing scalable machine learning and data processing systems. With years in both practical implementation and educational endeavors, she has a talent for breaking down complex concepts into accessible learning paths. Her approach is hands-on and results-oriented, aimed at empowering professionals to excel in the field of data science. Who is it for? This book is intended for data scientists, machine learning engineers, and data engineers who are keen to ensure reproducibility in their workflows. Ideal readers may have familiarity with data science basics and some exposure to Kubernetes and programming languages like Python. By studying the book, learners will establish confidence in implementing Pachyderm for scalable and reliable data pipelines.

Data Lakehouse in Action

"Data Lakehouse in Action" provides a comprehensive exploration of the Data Lakehouse architecture, a modern solution for scalable and effective large-scale analytics. This book guides you through understanding the principles and components of the architecture, and its implementation using cloud platforms like Azure. Learn the practical techniques for designing robust systems tailored to organizational needs and maturity. What this Book will help me do Understand the evolution and need for modern data architecture patterns like Data Lakehouse. Learn how to design systems for data ingestion, storage, processing, and serving in a Data Lakehouse. Develop best practices for data governance and security in the Data Lakehouse architecture. Discover various analytics workflows enabled by the Data Lakehouse, including real-time and batch approaches. Implement practical Data Lakehouse patterns on a cloud platform, and integrate them with macro-patterns such as Data Mesh. Author(s) Pradeep Menon is a seasoned data architect and engineer with extensive experience implementing data analytics solutions for leading companies. With a penchant for simplifying complex architectures, Pradeep has authored several technical publications and frequently shares his expertise at industry conferences. His hands-on approach and passion for teaching shine through in his practical guides. Who is it for? This book is ideal for data professionals including architects, engineers, and data strategists eager to enhance their knowledge in modern analytics platforms. If you have a basic understanding of data architecture and are curious about implementing systems governed by the Data Lakehouse paradigm, this book is for you. It bridges foundational concepts with advanced practices, making it suitable for learners aiming to contribute effectively to their organization's analytics efforts.

Azure Data Engineer Associate Certification Guide

The "Azure Data Engineer Associate Certification Guide" is a comprehensive resource tailored for professionals preparing for the DP-203 exam. This book not only equips you with the theoretical knowledge needed to pass the certification but also provides hands-on experience with Azure's data engineering services. By the end of the book, you'll feel confident in tackling the certification exam and applying these skills on the job. What this Book will help me do Understand the core concepts of Azure data engineering and their practical applications. Gain proficiency in designing and deploying data storage and processing solutions using Azure services. Develop expertise in securing, monitoring, and optimizing Azure data solutions. Prepare effectively for the DP-203 certification exam with sample questions and practical exercises. Acquire skills to contribute to and excel in real-world Azure Data Engineering projects. Author(s) None Alex is a seasoned data engineer and cloud computing expert with years of experience designing, implementing, and optimizing data solutions. They have spent significant time working with Azure's ecosystem and have crafted this guide to share their insights and best practices. With a passion for teaching and mentoring, they aim to make complex technical concepts accessible to learners. Who is it for? This book caters to data engineering professionals aiming to achieve the DP-203 Azure Data Engineer Associate certification and advance their careers. It's ideal for individuals with fundamental knowledge of cloud-based data solutions and databases, seeking specialized expertise in Azure's data engineering tools. Whether you're upskilling or transitioning to a cloud-native environment, this guide serves as the roadmap to success.

Azure Databricks Cookbook

Azure Databricks is a robust analytics platform that leverages Apache Spark and seamlessly integrates with Azure services. In the Azure Databricks Cookbook, you'll find hands-on recipes to ingest data, build modern data pipelines, and perform real-time analytics while learning to optimize and secure your solutions. What this Book will help me do Design advanced data workflows integrating Azure Synapse, Cosmos DB, and streaming sources with Databricks. Gain proficiency in using Delta Tables and Spark for efficient data storage and analysis. Learn to create, deploy, and manage real-time dashboards with Databricks SQL. Master CI/CD pipelines for automating deployments of Databricks solutions. Understand security best practices for restricting access and monitoring Azure Databricks. Author(s) None Raj and None Jaiswal are experienced professionals in the field of big data and analytics. They are well-versed in implementing Azure Databricks solutions for real-world problems. Their collaborative writing approach ensures clarity and practical focus. Who is it for? This book is tailored for data engineers, scientists, and big data professionals who want to apply Azure Databricks and Apache Spark to their analytics workflows. A basic familiarity with Spark and Azure is recommended to make the best use of the recipes provided. If you're looking to scale and optimize your analytics pipelines, this book is for you.

Data Engineering on Azure

Build a data platform to the industry-leading standards set by Microsoft’s own infrastructure. In Data Engineering on Azure you will learn how to: Pick the right Azure services for different data scenarios Manage data inventory Implement production quality data modeling, analytics, and machine learning workloads Handle data governance Using DevOps to increase reliability Ingesting, storing, and distributing data Apply best practices for compliance and access control Data Engineering on Azure reveals the data management patterns and techniques that support Microsoft’s own massive data infrastructure. Author Vlad Riscutia, a data engineer at Microsoft, teaches you to bring an engineering rigor to your data platform and ensure that your data prototypes function just as well under the pressures of production. You'll implement common data modeling patterns, stand up cloud-native data platforms on Azure, and get to grips with DevOps for both analytics and machine learning. About the Technology Build secure, stable data platforms that can scale to loads of any size. When a project moves from the lab into production, you need confidence that it can stand up to real-world challenges. This book teaches you to design and implement cloud-based data infrastructure that you can easily monitor, scale, and modify. About the Book In Data Engineering on Azure you’ll learn the skills you need to build and maintain big data platforms in massive enterprises. This invaluable guide includes clear, practical guidance for setting up infrastructure, orchestration, workloads, and governance. As you go, you’ll set up efficient machine learning pipelines, and then master time-saving automation and DevOps solutions. The Azure-based examples are easy to reproduce on other cloud platforms. What's Inside Data inventory and data governance Assure data quality, compliance, and distribution Build automated pipelines to increase reliability Ingest, store, and distribute data Production-quality data modeling, analytics, and machine learning About the Reader For data engineers familiar with cloud computing and DevOps. About the Author Vlad Riscutia is a software architect at Microsoft. Quotes A definitive and complete guide on data engineering, with clear and easy-to-reproduce examples. - Kelum Prabath Senanayake, Echoworx An all-in-one Azure book, covering all a solutions architect or engineer needs to think about. - Albert Nogués, Danone A meaningful journey through the Azure ecosystem. You’ll be building pipelines and joining components quickly! - Todd Cook, Appen A gateway into the world of Azure for machine learning and DevOps engineers. - Krzysztof Kamyczek, Luxoft

The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform

Build efficient and scalable batch and real-time data ingestion pipelines, DevOps continuous integration and deployment pipelines, and advanced analytics solutions on the Azure Data Platform. This book teaches you to design and implement robust data engineering solutions using Data Factory, Databricks, Synapse Analytics, Snowflake, Azure SQL database, Stream Analytics, Cosmos database, and Data Lake Storage Gen2. You will learn how to engineer your use of these Azure Data Platform components for optimal performance and scalability. You will also learn to design self-service capabilities to maintain and drive the pipelines and your workloads. The approach in this book is to guide you through a hands-on, scenario-based learning process that will empower you to promote digital innovation best practices while you work through your organization’s projects, challenges, and needs. The clear examples enable you to use this book as a reference and guide for building data engineering solutions in Azure. After reading this book, you will have a far stronger skill set and confidence level in getting hands on with the Azure Data Platform. What You Will Learn Build dynamic, parameterized ELT data ingestion orchestration pipelines in Azure Data Factory Create data ingestion pipelines that integrate control tables for self-service ELT Implement a reusable logging framework that can be applied to multiple pipelines Integrate Azure Data Factory pipelines with a variety of Azure data sources and tools Transform data with Mapping Data Flows in Azure Data Factory Apply Azure DevOps continuous integration and deployment practices to your Azure Data Factory pipelines and development SQL databases Design and implement real-time streaming and advanced analytics solutions using Databricks, Stream Analytics, and Synapse Analytics Get started with a variety of Azure data services through hands-on examples Who This Book Is For Data engineers and data architects who are interested in learning architectural and engineering best practices around ELT and ETL on the Azure Data Platform, those who are creating complex Azure data engineering projects and are searching for patterns of success, and aspiring cloud and data professionals involved in data engineering, data governance, continuous integration and deployment of DevOps practices, and advanced analytics who want a full understanding of the many different tools and technologies that Azure Data Platform provides

Data Modeling for Azure Data Services

Data Modeling for Azure Data Services is an essential guide that delves into the intricacies of designing, provisioning, and implementing robust data solutions within the Azure ecosystem. Through practical examples and hands-on exercises, this book equips you with the knowledge to create scalable, performant, and adaptable database designs tailored to your business needs. What this Book will help me do Understand and apply normalization, dimensional modeling, and data vault modeling for relational databases. Learn to provision and implement scalable solutions like Azure SQL DB and Azure Synapse SQL Pool. Master how to design and model a Data Lake using Azure Storage efficiently. Gain expertise in NoSQL database modeling and implementing solutions using Azure Cosmos DB. Develop ETL/ELT processes effectively using Azure Data Factory to support data integration workflows. Author(s) None Braake brings a wealth of expertise as a data architect and cloud solutions builder specializing in Azure's data services. With hands-on experience in projects requiring sophisticated data modeling and optimization, None crafts detailed learning material to help professionals level up their database design and Azure deployment skills. Dedicated to explaining complex topics with clarity and approachable language, None ensures that the learners gain not just knowledge but applied competence. Who is it for? This book is a valuable resource for business intelligence developers, data architects, and consultants aiming to refine their skills in data modeling within modern cloud ecosystems, particularly Microsoft Azure. Whether you're a beginner with some foundational cloud data management knowledge or an experienced professional seeking to deepen your Azure data services proficiency, this book caters to your learning needs.

SQL Server on Kubernetes: Designing and Building a Modern Data Platform

Build a modern data platform by deploying SQL Server in Kubernetes. Modern application deployment needs to be fast and consistent to keep up with business objectives and Kubernetes is quickly becoming the standard for deploying container-based applications, fast. This book introduces Kubernetes and its core concepts. Then it shows you how to build and interact with a Kubernetes cluster. Next, it goes deep into deploying and operationalizing SQL Server in Kubernetes, both on premises and in cloud environments such as the Azure Cloud. You will begin with container-based application fundamentals and then go into an architectural overview of a Kubernetes container and how it manages application state. Then you will learn the hands-on skill of building a production-ready cluster. With your cluster up and running, you will learn how to interact with your cluster and perform common administrative tasks. Once you can admin the cluster, you will learn how to deploy applications and SQL Server in Kubernetes. You will learn about high-availability options, and about using Azure Arc-enabled Data Services. By the end of this book, you will know how to set up a Kubernetes cluster, manage a cluster, deploy applications and databases, and keep everything up and running. What You Will Learn Understand Kubernetes architecture and cluster components Deploy your applications into Kubernetes clusters Manage your containers programmatically through API objects and controllers Deploy and operationalize SQL Server in Kubernetes Implement high-availability SQL Server scenarios on Kubernetes using Azure Arc-enabled Data Services Make use of Kubernetes deployments for Big Data Clusters Who This Book Is For DBAs and IT architects who are ready to begin planning their next-generation data platform and want to understand what it takes to run SQL Server in a container in Kubernetes. SQL Server on Kubernetes is an excellent choice for those who want to understand the big picture of why Kubernetes is the next-generation deployment method for SQL Server but also want to understand the internals, or the how, of deploying SQL Server in Kubernetes. When finished with this book, you will have the vision and skills to successfully architect, build and maintain a modern data platform deploying SQL Server on Kubernetes.

Advanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data

Learn about business intelligence (BI) features in T-SQL and how they can help you with data science and analytics efforts without the need to bring in other languages such as R and Python. This book shows you how to compute statistical measures using your existing skills in T-SQL. You will learn how to calculate descriptive statistics, including centers, spreads, skewness, and kurtosis of distributions. You will also learn to find associations between pairs of variables, including calculating linear regression formulas and confidence levels with definite integration. No analysis is good without data quality. Advanced Analytics with Transact-SQL introduces data quality issues and shows you how to check for completeness and accuracy, and measure improvements in data quality over time. The book also explains how to optimize queries involving temporal data, such as when you search for overlapping intervals. More advanced time-oriented information in the book includes hazard and survival analysis. Forecasting with exponential moving averages and autoregression is covered as well. Every web/retail shop wants to know the products customers tend to buy together. Trying to predict the target discrete or continuous variable with few input variables is important for practically every type of business. This book helps you understand data science and the advanced algorithms use to analyze data, and terms such as data mining, machine learning, and text mining. Key to many of the solutions in this book are T-SQL window functions. Author Dejan Sarka demonstrates efficient statistical queries that are based on window functions and optimized through algorithms built using mathematical knowledge and creativity. The formulas and usage of those statistical procedures are explained so you can understand and modify the techniques presented. T-SQL is supported in SQL Server,Azure SQL Database, and in Azure Synapse Analytics. There are so many BI features in T-SQL that it might become your primary analytic database language. If you want to learn how to get information from your data with the T-SQL language that you already are familiar with, then this is the book for you. What You Will Learn Describe distribution of variables with statistical measures Find associations between pairs of variables Evaluate the quality of the data you are analyzing Perform time-series analysis on your data Forecast values of a continuous variable Perform market-basket analysis to predict customer purchasing patterns Predict target variable outcomes from one or more input variables Categorize passages of text by extracting and analyzing keywords Who This Book Is For Database developers and database administrators who want to translate their T-SQL skills into the world of business intelligence (BI) and data science. For readers who want to analyze large amounts of data efficiently by using their existing knowledge of T-SQL and Microsoft’s various database platforms such as SQL Server and Azure SQL Database. Also for readers who want to improve their querying by learning new and original optimization techniques.