SQL

Best practices to maximize the availability of your Cloud SQL databases

2024-04-09 · Google Cloud Next '24

session

by Patrick Kirby (Workday) , Rahul Deshmukh (Google Cloud)

Cloud Computing GCP

Customers use Cloud SQL to run their business-critical applications. In this session, we will give you a comprehensive understanding of the various capabilities of Cloud SQL and best practices to maximize business continuity for the applications. The session will deep dive into Enterprise Plus edition features, how Cloud SQL achieves near-zero downtime maintenance, behaviors that affect availability and mitigations, all of which will prepare you to be an expert in configuring and monitoring Cloud SQL for maximum availability.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

Core Infrastructure l

2024-04-09 · Google Cloud Next '24

session

by Eoin Carroll (Google Cloud)

BigQuery Cloud Computing GCP Virtual Machine

In this game you will create and manage permissions for Google Cloud resources, run structured queries on BigQuery and Cloud SQL, create several VPC networks and VM instances and test connectivity across networks, and monitor a Google Compute Engine VM instance with Cloud Monitoring.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

Data Analytics & Visualization All-in-One For Dummies

2024-04-09 · O'Reilly Data Visualization Books O'Reilly Amazon

book

by Alan R. Simon , Lillian Pierson , John Paul Mueller , Jack A. Hyman , Jonathan Reichental , Luca Massaron , Joseph Schmuller , Allen G. Taylor , Paul McFedries

Analytics BI Data Analytics Microsoft Power BI Python Tableau data data-science data-science-tasks data-visualization

Install data analytics into your brain with this comprehensive introduction Data Analytics & Visualization All-in-One For Dummies collects the essential information on mining, organizing, and communicating data, all in one place. Clocking in at around 850 pages, this tome of a reference delivers eight books in one, so you can build a solid foundation of knowledge in data wrangling. Data analytics professionals are highly sought after these days, and this book will put you on the path to becoming one. You’ll learn all about sources of data like data lakes, and you’ll discover how to extract data using tools like Microsoft Power BI, organize the data in Microsoft Excel, and visually present the data in a way that makes sense using a Tableau. You’ll even get an intro to the Python, R, and SQL coding needed to take your data skills to a new level. With this Dummies guide, you’ll be well on your way to becoming a priceless data jockey. Mine data from data sources Organize and analyze data Use data to tell a story with Tableau Expand your know-how with Python and R New and novice data analysts will love this All-in-One reference on how to make sense of data. Get ready to watch as your career in data takes off.

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

2024-04-07 · Data Engineering Podcast Listen

podcast_episode

by Artyom Keydunov (Cube Dev) , Tobias Macey

AI/ML Analytics BI Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Quality Datafold dbt +5 more

Summary

Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Artyom Keydunov about the role of the semantic layer in your data platform

Interview

Introduction How did you get involved in the area of data management? Can you start by outlining the technical elements of what it means to have a "semantic layer"? In the past couple of years there was a rapid hype cycle around the "metrics layer" and "headless BI", which has largely faded. Can you give your assessment of the current state of the industry around the adoption/implementation of these concepts? What are the benefits of having a discrete service that offers the business metrics/semantic mappings as opposed to implementing those concepts as part of a more general system? (e.g. dbt, BI, warehouse marts, etc.)

At what point does it become necessary/beneficial for a team to adopt such a service? What are the challenges involved in retrofitting a semantic layer into a production data system?

evolution of requirements/usage patterns technical complexities/performance and cost optimization What are the most interesting, innovative, or unexpected ways that you have seen Cube used? What are the most interesting, unexpec

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

2024-03-31 · Data Engineering Podcast Listen

podcast_episode

by Maayan Salom (Elementary) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Quality Datafold dbt Delta +4 more

Summary

Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold. Your host is Tobias Macey and today I'm interviewing Maayan Salom about how to incorporate observability into a dbt-oriented workflow and how Elementary can help

Interview

Introduction How did you get involved in the area of data management? Can you start by outlining what elements of observability are most relevant for dbt projects? What are some of the common ad-hoc/DIY methods that teams develop to acquire those insights?

What are the challenges/shortcomings associated with those approaches?

Over the past ~3 years there were numerous data observability systems/products created. What are some of the ways that the specifics of dbt workflows are not covered by those generalized tools?

What are the insights that can be more easily generated by embedding into the dbt toolchain and development cycle?

Can you describe what Elementary is and how it is designed to enhance the development and maintenance work in dbt projects? How is Elementary designed/implemented?

How have the scope and goals of the project changed since you started working on it? What are the engineering ch

#42 Unraveling the Fabric of Data: Microsoft's Ecosystem and Beyond

2024-03-25 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

by Sam Debruyn (Microsoft)

AI/ML Azure BI Databricks dbt LLM Microsoft Fabric Power BI Synapse

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style! In this episode #42, titled "Unraveling the Fabric of Data: Microsoft's Ecosystem and Beyond," we're joined once again by the tech maestro and newly minted Microsoft MVP, Sam Debruyn. Sam brings to the table a bevy of updates from his recent accolades to the intricacies of Microsoft's data platforms and the world of SQL.

Biz Buzz: From Reddit's IPO to the performance versus utility debate in database selection, we dissect the big moves shaking up the business side of tech. Read about Reddit's IPO.Microsoft's Fabric Unraveled: Get the lowdown on Microsoft's Fabric, the one-stop AI platform, as Sam Debruyn gives us a deep dive into its capabilities and integration with Azure Databricks and Power BI. Discover more about Fabric and dive into Sam's blog.dbt Developments: Sam talks dbt and the exciting new SQL tool for data pipeline building with upcoming unit testing capabilities.Polaris Project: Delving into Microsoft's internal storage projects, including insights on Polaris and its integration with Synapse SQL. Read the paper here.AI Advances: From the release of Grok-1 and Apple's MM1 AI model to GPT-4's trillion parameters, we discuss the leaps in artificial intelligence.Stability in Motion: After OpenAI's Sora, we look at Stability AI's new venture into motion with Stable Video. Check out Stable Video.Benchmarking Debate: A critical look at performance benchmarks in database selection and the ongoing search for the 'best' database. Contemplate benchmarking perspectives.Versioning Philosophy: Hot takes on semantic versioning and what stability really means in software development. Dive into Semantic Versioning.

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

2024-03-24 · Data Engineering Podcast Listen

podcast_episode

by Pete Hunt (Dagster Labs) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg Modern Data Stack +3 more

Summary

A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Pete Hunt about how the launch of Dagster+ will level up your data platform and orchestrate across language platforms

Interview

Introduction How did you get involved in the area of data management? Can you describe what the focus of Dagster+ is and the story behind it?

What problems are you trying to solve with Dagster+? What are the notable enhancements beyond the Dagster Core project that this updated platform provides? How is it different from the current Dagster Cloud product?

In the launch announcement you tease new capabilities that would be great to explore in turns:

Make data a team sport, enabling data teams across the organization Deliver reliable, high quality data the organization can trust Observe and manage data platform costs Master the heterogeneous collection of technologies—both traditional and Modern Data Stack

What are the business/product goals that you are focused on improving with the launch of Dagster+ What are the most interesting, innovative, or unexpected ways that you have seen Dagster used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the design and launch of Dagster+? When is Dagster+ the wrong choice? What do you have planned for the future of Dagster/Dagster Cloud/Dagster+?

Contact Info

Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If y

Azure Data Factory by Example: Practical Implementation for Data Engineers

2024-03-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Richard Swinbank

Analytics Azure ADF Cloud Computing DWH ETL/ELT Microsoft Synapse data data-engineering data-lake storage-repositories

Data engineers who need to hit the ground running will use this book to build skills in Azure Data Factory v2 (ADF). The tutorial-first approach to ADF taken in this book gets you working from the first chapter, explaining key ideas naturally as you encounter them. From creating your first data factory to building complex, metadata-driven nested pipelines, the book guides you through essential concepts in Microsoft’s cloud-based ETL/ELT platform. It introduces components indispensable for the movement and transformation of data in the cloud. Then it demonstrates the tools necessary to orchestrate, monitor, and manage those components. This edition, updated for 2024, includes the latest developments to the Azure Data Factory service: Enhancements to existing pipeline activities such as Execute Pipeline, along with the introduction of new activities such as Script, and activities designed specifically to interact with Azure Synapse Analytics. Improvements to flow control provided by activity deactivation and the Fail activity. The introduction of reusable data flow components such as user-defined functions and flowlets. Extensions to integration runtime capabilities including Managed VNet support. The ability to trigger pipelines in response to custom events. Tools for implementing boilerplate processes such as change data capture and metadata-driven data copying. What You Will Learn Create pipelines, activities, datasets, and linked services Build reusable components using variables, parameters, and expressions Move data into and around Azure services automatically Transform data natively using ADF data flows and Power Query data wrangling Master flow-of-control and triggers for tightly orchestrated pipeline execution Publish and monitor pipelines easily and with confidence Who This Book Is For Data engineers and ETL developers taking their first steps in Azure Data Factory, SQL Server Integration Services users making the transition toward doing ETL in Microsoft’s Azure cloud, and SQL Server database administrators involved in data warehousing and ETL operations

102: Exposing How Alex The Analyst Became a Data Analyst (And The Most Popular Data YouTuber)

2024-03-21 · Data Career Podcast: Helping You Land a Data Analyst Job FAST Listen

podcast_episode

by Avery Smith , Alex Freberg

AI/ML Analytics Data Analytics Python

Hear the story of Alex The Analyst like you've never heard it before. In this episode, Avery Smith sits down with Alex Freberg, more commonly known as Alex the Analyst to discuss his journey from no technical background to data analyst superstar.

They talk about Alex's journey from a recreational therapy degree to learning what data analytics is. They also cover what matters most when getting hired as a data analyst. Is it technical skills like SQL and Python? Or is it something much simpler?

Connect with Alex the Analyst :

🤝 Follow on Linkedin

▶️ Subscribe on Youtube

🎒 Learn About Analyst Builder

✉️ Discover what we wish we knew about landing the dream job

🤖 Data Analytics Answers At Your Finger Tips

🤝 Ace your data analyst interview with the interview simulator

📩 Get my weekly email with helpful data career tips

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Timestamps:

(6:01) Alex's Data Career Journey (11:50) Alex's First Portfolio (17:53) Alex's Advice on Getting Hired & Interviews (27:10) How to Become an Analyst in 7 Days

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa a...

Reconciling The Data In Your Databases With Datafold

2024-03-17 · Data Engineering Podcast Listen

podcast_episode

by Gleb Mezhanskiy (Datafold) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Datafold Delta Hudi +3 more

Summary

A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about how to reconcile data in database environments

Interview

Introduction How did you get involved in the area of data management? Can you start by outlining some of the situations where reconciling data between databases is needed? What are examples of the error conditions that you are likely to run into when duplicating information between database engines?

When these errors do occur, what are some of the problems that they can cause?

When teams are replicating data between database engines, what are some of the common patterns for managing those flows?

How does that change between continual and one-time replication?

What are some of the steps involved in verifying the integrity of data replication between database engines? If the source or destination isn't a traditional database engine (e.g. data lakehouse) how does that change the work involved in verifying the success of the replication? What are the challenges of validating and reconciling data?

Sheer scale and cost of pulling data out, have to do in-place Performance. Pushing databases to the limit,

Version Your Data Lakehouse Like Your Software With Nessie

2024-03-10 · Data Engineering Podcast Listen

podcast_episode

by alex merced (Dremio) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Delta Dremio Git +4 more

Summary

Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, "Apache Iceberg, The definitive Guide", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg

Interview

Introduction How did you get involved in the area of data management? Can you describe what Nessie is and the story behind it? What are the core problems/complexities that Nessie is designed to solve? The closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case? Why would someone choose Nessie over native table-level branching in the Apache Iceberg spec? How do the versioning capabilities compare to/augment the data versioning in Iceberg? What are some of the sources of, and challenges in resolving, merge conflicts between table branches? Can you describe the architecture of Nessie? How have the design and goals of the project changed since it was first created? What is involved

AI's Impact in the World of Structured Data Analytics (w/ Juan Sequeda, data.world)

2024-03-10 · The Analytics Engineering Podcast Listen

podcast_episode

by Juan Sequeda (data.world)

AI/ML Analytics Analytics Engineering Data Analytics dbt

Juan Sequeda is a principal data scientist and head of the AI Lab at data.world, and is also the co-host of the fantastic data podcast Catalog and Cocktails. This episode tackles semantics, semantic web, Juan's research in how raw text-to-SQL performs versus text-to-semantic layer, and where we both believe AI will make an impact in the world of structured data analytics. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

When And How To Conduct An AI Program

2024-03-03 · Data Engineering Podcast Listen

podcast_episode

by Colleen Tartow (Starburst Data) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Delta Hudi Iceberg +2 more

Summary

Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Colleen Tartow about the questions to answer before and during the development of an AI program

Interview

Introduction How did you get involved in the area of data management? When you say "AI Program", what are the organizational, technical, and strategic elements that it encompasses?

How does the idea of an "AI Program" differ from an "AI Product"? What are some of the signals to watch for that indicate an objective for which AI is not a reasonable solution?

Who needs to be involved in the process of defining and developing that program?

What are the skills and systems that need to be in place to effectively execute on an AI program?

"AI" has grown to be an even more overloaded term than it already was. What are some of the useful clarifying/scoping questions to address when deciding the path to deployment for different definitions of "AI"? Organizations can easily fall into the trap of green-lighting an AI project before they have done the work of ensuring they have the necessary data and the ability to process it. What are the steps to take to build confidence in the availability of the data?

Even if you are sure that you can get the data, what are t

Cracking the Data Science Interview

2024-02-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Aaren Stubberfield , Leondra R. Gonzalez

AI/ML Bash Data Science Git Python data data-science

"Cracking the Data Science Interview" is your ultimate resource for preparing for roles in the competitive field of data science. With this book, you'll explore essential topics such as Python, SQL, statistics, and machine learning, as well as learn practical skills for building portfolios and acing interviews. Follow its guidance and you'll be equipped to stand out in any data science interview. What this Book will help me do Confidently explain complex statistical and machine learning concepts. Develop models and deploy them while ensuring version control and efficiency. Learn and apply scripting skills in shell and Bash for productivity. Master Git workflows to handle collaborative coding in projects. Perfectly tailor portfolios and resumes to land data science opportunities. Author(s) Leondra R. Gonzalez, with years of data science and mentorship experience, co-authors this book with None Stubberfield, a seasoned expert in technology and machine learning. Together, they integrate their expertise to provide practical advice for navigating the data science job market. Who is it for? If you're preparing for data science interviews, this book is for you. It's ideal for candidates with a foundational knowledge of Python, SQL, and statistics looking to refine and expand their technical and professional skills. Professionals transitioning into data science will also find it invaluable for building confidence and succeeding in this rewarding field.

Learn Microsoft Fabric

2024-02-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Arshad Ali , Bradley Schacht

AI/ML Analytics Data Analytics Data Science Microsoft Fabric Cyber Security Spark analytics-platforms data data-science microsoft-fabric

Dive into the wonders of Microsoft Fabric, the ultimate solution for mastering data analytics in the AI era. Through engaging real-world examples and hands-on scenarios, this book will equip you with all the tools to design, build, and maintain analytics systems for various use cases like lakehouses, data warehouses, real-time analytics, and data science. What this Book will help me do Understand and utilize the key components of Microsoft Fabric for modern analytics. Build scalable and efficient data analytics solutions with medallion architecture. Implement real-time analytics and machine learning models to derive actionable insights. Monitor and administer your analytics platform for high performance and security. Leverage AI-powered assistant Copilot to boost analytics productivity. Author(s) Arshad Ali and None Schacht bring years of expertise in data analytics and system architecture to this book. Arshad is a seasoned professional specialized in AI-integrated analytics platforms, while None Schacht has a proven track record in deploying enterprise data solutions. Together, they provide deep insights and practical knowledge with a structured and approachable teaching method. Who is it for? Ideal for data professionals such as data analysts, engineers, scientists, and AI/ML experts aiming to enhance their data analytics skills and master Microsoft Fabric. It's also suited for students and new entrants to the field looking to establish a firm foundation in analytics systems. Requires a basic understanding of SQL and Spark.

Learn T-SQL Querying - Second Edition

2024-02-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pam Lahoud , Pedro Lopes

Azure Microsoft SQL Server data data-engineering

Troubleshoot query performance issues, identify anti-patterns in your code, and write efficient T-SQL queries with this guide for T-SQL developers Key Features A definitive guide to mastering the techniques of writing efficient T-SQL code Learn query optimization fundamentals, query analysis, and how query structure impacts performance Discover insightful solutions to detect, analyze, and tune query performance issues Purchase of the print or Kindle book includes a free PDF eBook Book Description Data professionals seeking to excel in Transact-SQL for Microsoft SQL Server and Azure SQL Database often lack comprehensive resources. Learn T-SQL Querying second edition focuses on indexing queries and crafting elegant T-SQL code enabling data professionals gain mastery in modern SQL Server versions (2022) and Azure SQL Database. The book covers new topics like logical statement processing flow, data access using indexes, and best practices for tuning T-SQL queries. Starting with query processing fundamentals, the book lays a foundation for writing performant T-SQL queries. You’ll explore the mechanics of the Query Optimizer and Query Execution Plans, learning to analyze execution plans for insights into current performance and scalability. Using dynamic management views (DMVs) and dynamic management functions (DMFs), you’ll build diagnostic queries. The book covers indexing and delves into SQL Server’s built-in tools to expedite resolution of T-SQL query performance and scalability issues. Hands-on examples will guide you to avoid UDF pitfalls and understand features like predicate SARGability, Query Store, and Query Tuning Assistant. By the end of this book, you‘ll have developed the ability to identify query performance bottlenecks, recognize anti-patterns, and avoid pitfalls What you will learn Identify opportunities to write well-formed T-SQL statements Familiarize yourself with the Cardinality Estimator for query optimization Create efficient indexes for your existing workloads Implement best practices for T-SQL querying Explore Query Execution Dynamic Management Views Utilize the latest performance optimization features in SQL Server 2017, 2019, and 2022 Safeguard query performance during upgrades to newer versions of SQL Server Who this book is for This book is for database administrators, database developers, data analysts, data scientists and T-SQL practitioners who want to master the art of writing efficient T-SQL code and troubleshooting query performance issues through practical examples. A basic understanding of T-SQL syntax, writing queries in SQL Server, and using the SQL Server Management Studio tool will be helpful to get started.

Graph Algorithms for Data Science

2024-02-26 · O'Reilly Data Science Books O'Reilly Amazon

book

by Tomaz Bratanic

AI/ML CSV Data Science NLP data data-science

Practical methods for analyzing your data with graphs, revealing hidden connections and new insights. Graphs are the natural way to represent and understand connected data. This book explores the most important algorithms and techniques for graphs in data science, with concrete advice on implementation and deployment. You don’t need any graph experience to start benefiting from this insightful guide. These powerful graph algorithms are explained in clear, jargon-free text and illustrations that makes them easy to apply to your own projects. In Graph Algorithms for Data Science you will learn: Labeled-property graph modeling Constructing a graph from structured data such as CSV or SQL NLP techniques to construct a graph from unstructured data Cypher query language syntax to manipulate data and extract insights Social network analysis algorithms like PageRank and community detection How to translate graph structure to a ML model input with node embedding models Using graph features in node classification and link prediction workflows Graph Algorithms for Data Science is a hands-on guide to working with graph-based data in applications like machine learning, fraud detection, and business data analysis. It’s filled with fascinating and fun projects, demonstrating the ins-and-outs of graphs. You’ll gain practical skills by analyzing Twitter, building graphs with NLP techniques, and much more. About the Technology A graph, put simply, is a network of connected data. Graphs are an efficient way to identify and explore the significant relationships naturally occurring within a dataset. This book presents the most important algorithms for graph data science with examples from machine learning, business applications, natural language processing, and more. About the Book Graph Algorithms for Data Science shows you how to construct and analyze graphs from structured and unstructured data. In it, you’ll learn to apply graph algorithms like PageRank, community detection/clustering, and knowledge graph models by putting each new algorithm to work in a hands-on data project. This cutting-edge book also demonstrates how you can create graphs that optimize input for AI models using node embedding. What's Inside Creating knowledge graphs Node classification and link prediction workflows NLP techniques for graph construction About the Reader For data scientists who know machine learning basics. Examples use the Cypher query language, which is explained in the book. About the Author Tomaž Bratanič works at the intersection of graphs and machine learning. Arturo Geigel was the technical editor for this book. Quotes Undoubtedly the quickest route to grasping the practical applications of graph algorithms. Enjoyable and informative, with real-world business context and practical problem-solving. - Roger Yu, Feedzai Brilliantly eases you into graph-based applications. - Sumit Pal, Independent Consultant I highly recommend this book to anyone involved in analyzing large network databases. - Ivan Herreros, talentsconnect Insightful and comprehensive. The author’s expertise is evident. Be prepared for a rewarding journey. - Michal Štefaňák, Volke

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

2024-02-25 · Data Engineering Podcast Listen

podcast_episode

by Paul Dix (InfluxData) , Tobias Macey

AI/ML Analytics Arrow Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Delta Hudi +4 more

Summary

Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design

Interview

Introduction How did you get involved in the area of data management? Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines?

This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture?

Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components? One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform? Can you describe the

#38 Open Source AI, SQL Dialects, and New Terminals

2024-02-23 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

AI/ML Rust

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style! In episode #38, "Open Source AI, SQL Dialects, and New Terminals," we've taken a slight detour from our usual live format to bring you an exceptionally pre-recorded session, packed with the same insightful discussions and a touch of geeky humor. Tested Midjourney Alpha UI: A first look at the new user interface. Midjourney Alpha UIWhen should you give up on a project that doesn't work?: Exploring the fine line between persistence and practicality. When to give upThe Open Source AI Definition – draft v. 0.0.5: Navigating the evolving landscape of open-source AI. Open Source AI DefinitionGhostty: Delving into the latest development logs. Ghostty DevlogSQL standards are like toothbrushes: Discussing the universal challenge of SQL dialects adoption. SQL standardsMojo vs. Rust: Comparing performance, Pythonic syntax, and the learning curve. Mojo vs. RustRemember, this episode was not broadcasted live but was exceptionally pre-recorded to maintain our tradition of bringing you the most engaging and relevant tech discussions. Intro music courtesy of fesliyanstudios.com

98: Day in the Life of a Data Analyst With Non-Technical Background w/ Paul Ahlstrom

2024-02-21 · Data Career Podcast: Helping You Land a Data Analyst Job FAST Listen

podcast_episode

by Avery Smith , Paul Ahlstrom

AI/ML Analytics Data Analytics

In this episode of the Data Career Podcast, Avery talks with his childhood friend, Paul Alstrom, about his journey into data analytics from a non-technical background.

Paul emphasises the importance of networking, understanding the business, and getting the requirements right at the start.

They also explore the day-to-day life of a data analyst, how to make yourself useful to the business, as well as how to manage senior stakeholders.

Connect with Paul Ahlstrom:

🤝 Connect on Linkedin

✉️ Discover what we wish we knew about landing the dream job

🤖 Data Analytics Answers At Your Finger Tips

🤝 Ace your data analyst interview with the interview simulator

📩 Get my weekly email with helpful data career tips

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Timestamps:

(14:31) - The Importance of Networking in Job Hunting (22:44) - Understanding User Behavior through Data (23:12) - The Role of SQL in Data Analysis (23:25) - Business Use Cases for Data Analysis (27:55) - The Art of Reporting in Data Analysis (29:14) - The Importance of Asking the Right Questions (31:17) - The Role of Communication in Data Analysis (31:46) - The Power of Iterative Analytics (39:47) - Understanding the Business Context in Data Analysis

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

talk-data.com

Activity Trend

Top Events

Top Speakers

Best practices to maximize the availability of your Cloud SQL databases

Core Infrastructure l

Data Analytics & Visualization All-in-One For Dummies

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

#42 Unraveling the Fabric of Data: Microsoft's Ecosystem and Beyond

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Azure Data Factory by Example: Practical Implementation for Data Engineers

102: Exposing How Alex The Analyst Became a Data Analyst (And The Most Popular Data YouTuber)

Reconciling The Data In Your Databases With Datafold

Version Your Data Lakehouse Like Your Software With Nessie

AI's Impact in the World of Structured Data Analytics (w/ Juan Sequeda, data.world)

When And How To Conduct An AI Program

Cracking the Data Science Interview

Learn Microsoft Fabric

Learn T-SQL Querying - Second Edition

Graph Algorithms for Data Science

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

#38 Open Source AI, SQL Dialects, and New Terminals

98: Day in the Life of a Data Analyst With Non-Technical Background w/ Paul Ahlstrom