Cloud Computing

SnowPro™ Core Certification Companion: Hands-on Preparation and Practice

2023-03-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Maja Ferle

Cyber Security Snowflake data data-engineering

This study companion helps you prepare for the SnowPro Core Certification exam. The author guides your studies so you will not have to tackle the exam by yourself. To help you track your progress, chapters in this book correspond to the exam domains as described on Snowflake’s website. Upon studying the material in this book, you will have solid knowledge that should give you the best shot possible at taking and passing the exam and earning the certification you deserve. Each chapter provides explanations, instructions, guidance, tips, and other information with the level of detail that you need to prepare for the exam. You will not waste your time with unneeded detail and advanced content which is out of scope of the exam. Focus is kept on reviewing the materials and helping you become familiar with the content of the exam that is recommended by Snowflake. This Book Helps You Review the domainsthat Snowflake specifically recommends you study in preparation for Exam COF-C02 Identify gaps in your knowledge that you can study and fill in to increase your chances of passing Exam COF-C02 Level up your knowledge even if not taking the exam, so you know the same material as someone who has taken the exam Learn how to set up a Snowflake account and configure access according to recommended security best practices Be capable of loading structured and unstructured data into Snowflake as well as unloading data from Snowflake Understand how to apply Snowflake data protection features such as cloning, time travel, and fail safe Review Snowflake’s data sharing capabilities, including data marketplace and data exchange Who This Book Is For Those who are planning to take the SnowPro Core Certification COF-C02 exam, and anyone who wishes to gain core expertise in implementing and migrating tothe Snowflake Data Cloud

Let's talk data transformation and what makes a great seller with Co-founder and CEO of Coalesce, Armon Petrossian. As more companies move to the cloud the weakness in data transformation is becoming profound.

2023-02-22 · Making Data Simple Listen

podcast_episode

by Armon Petrossian (Coalesce) , Al Martin (IBM)

ETL/ELT IBM

Send us a text

Let's talk data transformation and what makes a great seller with Co-founder and CEO of Coalesce, Armon Petrossian. As more companies move to the cloud the weakness in data transformation is becoming profound. 01:15 Meet Armon Petrossian02:21 The problem in being data driven04:28 Introducing Coalesce07:53 Automation defined09:56 The ETL to ELT shift11:54 Typical customer use case12:55 The Coalesce differentiator16:18 Proof of value24:39 Column metadata25:40 Data transformation inhibitors28:17 The future31:52 What makes a great sellerLinkedIn: https://www.linkedin.com/in/armonpetrossian/ Website:

https://coalesce.io/ Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Expert Performance Indexing in Azure SQL and SQL Server 2022: Toward Faster Results and Lower Maintenance Both on Premises and in the Cloud

2023-02-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Edward Pollack , Jason Strate

Azure SQL XML data data-engineering microsoft-sql-server relational-databases

Take a deep dive into perhaps the single most important facet of query performance—indexes—and how to best use them. Newly updated for SQL Server 2022 and Azure SQL, this fourth edition includes new guidance and features related to columnstore indexes, improved and consolidated content on Query Store, deeper content around Intelligent Query Processing, and other updates to help you optimize query execution and make performance improvements to even the most challenging workloads. The book begins with explanations of the types of indexes and how they are stored in a database. Moving further into the book, you will learn how statistics are critical for optimal index usage and how the Index Advisor can assist in reviewing and optimizing index health. This book helps you build a clear understanding of how indexes work, how to implement and use them, and the many options available to tame even the most large and complex workloads. What You Will Learn Properly index row store, columnstore, and memory-optimized tables Make use of Intelligent Query Processing for faster query results Review statistics to understand indexing choices made by the optimizer Apply indexing strategies such as covering indexes, included columns, and index intersections Recognize and remove unnecessary indexes Design effective indexes for full-text, spatial, and XML data types Who This Book Is For Azure SQL and SQL Server administrators and developers who are ready to improve the performance of their database environment by thoughtfully building indexes to speed up queries that matter the most and make a difference to the business

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

2023-02-19 · Data Engineering Podcast Listen

podcast_episode

by Ryan Blue (Tabular) , Tobias Macey

AI/ML Data Engineering Data Lake Data Lakehouse Data Management GitHub Iceberg Modern Data Stack Python

Summary

Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular

Interview

Introduction How did you get involved in the area of data management? Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem?

Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations?

What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018? Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects?

Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons?

For someone who wants to manage their data in Iceberg tables, what does the implementation look like?

How does that change based on the type of query/processing engine being used?

Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular? When is Iceberg/Tabular the wrong choice? What do you have planned for the future of Iceberg/Tabular?

Contact Info

LinkedIn rdblue on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the

Let The Whole Team Participate In Data With The Quilt Versioned Data Hub

2023-02-11 · Data Engineering Podcast Listen

podcast_episode

by Aneesh Karve (Quilt Data) , Tobias Macey

AI/ML Avro CloudFormation Data Engineering Data Management Delta Docker Iceberg ORC Parquet Python S3 +3 more

Summary

Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Your host is Tobias Macey and today I'm interviewing Aneesh Karve about how Quilt Data helps you bring order to your chaotic data in S3 with transactional versioning and data discovery built in

Interview

Introduction How did you get involved in the area of data management? Can you describe what Quilt is and the story behind it?

How have the goals and features of the Quilt platform changed since I spoke with Kevin in June of 2018?

What are the main problems that users are trying to solve when they find Quilt?

What are some of the alternative approaches/products that they are coming from?

How does Quilt compare with options such as LakeFS, Unstruk, Pachyderm, etc.? Can you describe how Quilt is implemented? What are the types of tools and systems that Quilt gets integrated with?

How do you manage the tension between supporting the lowest common denominator, while providing options for more advanced capabilities?

What is a typical workflow for a team that is using Quilt to manage their data? What are the most interesting, innovative, or unexpected ways that you have seen Quilt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Quilt? When is Quilt the wrong choice? What do you have planned for the future of Quilt?

Contact Info

LinkedIn @akarve on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Quilt Data

Podcast Episode

UW Madison Docker Swarm Kaggle open.quiltdata.com FinOS Perspective LakeFS

Podcast Episode

Pachyderm

Podcast Episode

Unstruk

Podcast Episode

Parquet Avro ORC Cloudformation Troposphere CDK == Cloud Development Kit Shadow IT

Podcast Episode

Delta Lake

Podcast Episode

Apache Iceberg

Podcast Episode

Datasette Frictionless DVC

Podcast.init Episode

The in

RISE with SAP towards a Sustainable Enterprise

2023-02-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mihir R Gor , Sanket Taur , Dharma Alturi , Adil Zafar

ERP SAP data data-engineering

Kickstart your journey towards becoming a sustainable and value-driven enterprise with "RISE with SAP" as your guide. This book explains how to optimize your business processes and implement S/4HANA effectively using RISE with SAP, preparing decision-makers and architects with actionable insights and strategic guidance. What this Book will help me do Understand the challenges organizations face when adopting market trends and how to address them effectively. Learn to build a robust business case for transitioning to SAP S/4HANA using RISE with SAP as the foundational framework. Gain insights into process discovery, data migration, and the best practices for the fit-to-standard approach. Develop skills to design optimized enterprise landscapes effectively on the RISE with SAP platform. Master strategies to leverage SAP tools, services, and cloud ecosystems for industry-specific transformation. Author(s) Adil Zafar, Dharma Alturi, Sanket Taur, and Mihir R. Gor bring together years of combined expertise in enterprise architecture and SAP ecosystems. They leverage their hands-on experience to provide readers with practical advice and cutting-edge insights. Their collaborative work aims to demystify complexities and guide professionals toward sustainable practices. Who is it for? This book is ideal for CXOs, enterprise architects, and solution architects operating in SAP ecosystems who seek practical guidance for transitioning to SAP S/4HANA via RISE with SAP. It caters to readers who wish to build business cases effectively and ensure sustainable and optimized implementation. Prior experience with SAP or ERP systems will enhance the learning experience.

Reflecting On The Past 6 Years Of Data Engineering

2023-02-06 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey

AI/ML Airflow Alation Analytics API AWS Lambda BI Big Data Dagster Data Engineering Data Management Data Quality +12 more

Summary

This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years

Interview

Introduction 6 years of running the Data Engineering Podcast Around the first time that data engineering was discussed as a role

Followed on from hype about "data science"

Hadoop era Streaming Lambda and Kappa architectures

Not really referenced anymore

"Big Data" era of capture everything has shifted to focusing on data that presents value

Regulatory environment increases risk, better tools introduce more capability to understand what data is useful

Data catalogs

Amundsen and Alation

Orchestration engine

Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc. Orchestration is now a part of most vertical tools

Cloud data warehouses Data lakes DataOps and MLOps Data quality to data observability Metadata for everything

Data catalog -> data discovery -> active metadata

Business intelligence

Read only reports to metric/semantic layers Embedded analytics and data APIs

Rise of ELT

dbt Corresponding introduction of reverse ETL

What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast? What do you have planned for the future of the podcast?

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Materialize:

Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use.

Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.

Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.

Go to materialize.comSupport Data Engineering Podcast

Navigating Career Changes in Machine Learning - Chris Szafranek

2023-02-03 · DataTalks.Club Listen

podcast_episode

by Chris Szafranek

AI/ML C#/.NET Data Engineering Data Science DataOps GitHub HTML LLM

We talked about

Chris’s background Switching careers multiple times Freedom at companies Chris’s role as an internal consultant Chris’s sabbatical ChatGPT How being a generalist helped Chris in his career The cons of being a generalist and the importance of T-shaped expertise The importance of learning things you’re interested in Tips to enjoy learning new things Recruiting generalists The job market for generalists vs for specialists Narrowing down your interests Chris’s book recommendations

Links:

Lex Fridman: science, philosophy, media, AI (especially earlier episodes): https://www.youtube.com/lexfridman Andrej Karpathy, former Senior Director of AI at Tesla, who's now focused on teaching and sharing his knowledge: https://www.youtube.com/@AndrejKarpathy Beautifully done videos on engineering of things in the real world: https://www.youtube.com/@RealEngineering Chris' website: https://szafranek.net/ Zalando Tech Radar: https://opensource.zalando.com/tech-radar/ Modal Labs, new way of deploying code to the cloud, also useful for testing ML code on GPUs: https://modal.com Excellent Twitter account to follow to learn more about prompt engineering for ChatGPT: https://twitter.com/goodside Image prompts for Midjourney: https://twitter.com/GuyP Machine Learning Workflows in Production - Krzysztof Szafanek: https://www.youtube.com/watch?v=CO4Gqd95j6k From Data Science to DataOps: https://datatalks.club/podcast/s11e03-from-data-science-to-dataops.html

Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

MANAGING DASHBOARDS AT SCALE WITH LOOKER STUDIO AND BIGQUERY

2023-02-01 · Superweek 2023

talk

by Ahmad Kanani (/ Siavak)

BigQuery GCP Looker Looker Studio

In his talk, Ahmad is going to share the tools, solutions, and strategies he and his team developed in-house using Looker Studio, BigQuery and Google Cloud to be able to manage, maintain, and monitor data pipelines and dashboards for their clients efficiently and at scale.

Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

2023-01-30 · Data Engineering Podcast Listen

podcast_episode

by Chris Merrick (Omni Analytics) , Tobias Macey

AI/ML Analytics Arrow BI BigQuery Data Engineering Data Management Data Modelling dbt DuckDB Fivetran Looker +9 more

Summary

Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Your host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence

Interview

Introduction How did you get involved in the area of data management? Can you describe what Omni Analytics is and the story behind it?

What are the core goals that you are trying to achieve with building Omni?

Business intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market?

What are the technical and organizational anti-patterns that typically grow up around BI systems?

What are the elements that contribute to BI being such a difficult product to use effectively in an organization?

Can you describe how you have implemented the Omni platform?

How have the design/scope/goals of the product changed since you first started working on it?

What does the workflow for a team using Omni look like?

What are some of the developments in the broader ecosystem that have made your work possible?

What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses?

What are the most interesting, innovative, or unexpected ways that you have seen Omni used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni?

When is Omni the wrong choice?

What do you have planned for the future of Omni?

Contact Info

LinkedIn @cmerrick on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Omni Analytics Stitch RJ Metrics Looker

Podcast Episode

Singer dbt

Podcast Episode

Teradata Fivetran Apache Arrow

Podcast Episode

DuckDB

Podcast Episode

BigQuery Snowflake

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Materialize:

Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use.

Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.

Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.

Go to materialize.comSupport Data Engineering Podcast

Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI

2023-01-22 · Data Engineering Podcast Listen

podcast_episode

by Adam Kamor , Tobias Macey

AI/ML Analytics Data Engineering Data Management Python SQL Data Streaming postgresql

Summary

The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more. Your host is Tobias Macey and today I'm interviewing Adam Kamor about Tonic, a service for generating data sets that are safe for development, analytics, and machine learning

Interview

Introduction How did you get involved in the area of data management? Can you describe what Tonic is and the story behind it? What are the core problems that you are trying to solve? What are some of the ways that fake or obfuscated data is used in development and analytics workflows? challenges of reliably subsetting data

impact of ORMs and bad habits developers get into with database modeling

Can you describe how Tonic is implemented?

What are the units of composition that you are building to allow for evolution and expansion of your product? How have the design and goals of the platform evolved since you started working on it?

Can you describe some of the different workflows that customers build on top of your various tools What are the most interesting, innovative, or unexpected ways that you have seen Tonic used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Tonic? When is Tonic the wrong choice? What do you have planned for the future of Tonic?

Contact Info

LinkedIn @AdamKamor on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Tonic

Djinn

Django

Analyzing a Downturn: Five Principles for Data & Analytics in 2023 - Audio Blog

2023-01-18 · Secrets of Data Analytics Leaders Listen

podcast_episode

Analytics Data Governance

We enter 2023 in a haze of uncertainty. Enterprises must rationalize analytics projects, shift to lower-risk use cases, and control cloud costs. They also must measure the ROI of analytics projects and use data governance to reduce business risk. Published at: https://www.eckerson.com/articles/analyzing-a-downturn-five-principles-for-data-analytics-in-2023

Path to Strategy with Roger Premo, First Impressions of IBM {Replay}

2023-01-18 · Making Data Simple Listen

podcast_episode

by Roger Premo (IBM) , Al Martin (IBM)

IBM

Send us a text Earnings are next week - so I thought I'd bring forward the IBM strategy again for a refresher.

Let's talk tech strategy: IBM strategy with Roger Premo, General Manager, Strategy and Corporate Development at IBM. The state of the industry, hybrid cloud, containers, competition. We hit it all. Show Notes 02:00 Roger Premo's path to strategy08:37 First impressions of IBM10:42 Hybrid cloud15:33 Facts to support hybrid18:57 Cloud's future21:40 Why containers, Why Redhat?25:55 Addressing lock-in30:08 The IBM bar pitch34:04 Start with outcomes36:10 The most exciting technology38:50 Continuous learning43:09 The chips actLinkedin: https://www.linkedin.com/in/ropremo/ Website: https://www.ibm.com/ Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Building Applications With Data As Code On The DataOS

2023-01-16 · Data Engineering Podcast Listen

podcast_episode

by Srujan Akula (The Modern Data Company) , Tobias Macey

Airflow Analytics BI Data Engineering Data Management dbt Modern Data Stack Monte Carlo PagerDuty SQL Data Streaming postgresql

Summary

The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more. Your host is Tobias Macey and today I'm interviewing Srujan Akula about DataOS, a pre-integrated and managed data platform built by The Modern Data Company

Interview

Introduction How did you get involved in the area of data management? Can you describe what your mission at The Modern Data Company is and the story behind it? Your flagship (only?) product is a platform that you're calling DataOS. What is the scope and goal of that platform?

Who is the target audience?

On your site you refer to the idea of "data as software". What are the principles and ways of thinking that are encompassed by that concept?

What are the platform capabilities that are required to make it possible?

There are 11 "Key Features" listed on your site for the DataOS. What was your process for identifying the "must have" vs "nice to have" features for launching the platform? Can you describe the technical architecture that powers your DataOS product?

What are the core principles that you are optimizing for in the design of your platform? How have the design and goals of the system changed or evolved since you started working on DataOS?

Can you describe the workflow for the different practitioners and stakeholders working on an installation of DataOS? What are the interfaces and escape hatches that are available for integrating with and ext

IBM FlashSystem 9500 Product Guide

2023-01-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shu Mookerjee , Konrad Trojok , Jon Herd , Hartmut Lonzer , Carsten Larsen , Douwe van Terwisga , Kendall Williams , Corne Lottering , Gucer Vasfi

BI IBM Cyber Security data data-engineering

This IBM® Redpaper® Product Guide describes the IBM FlashSystem® 9500 solution, which is a next-generation IBM FlashSystem control enclosure. It combines the performance of flash and a Non-Volatile Memory Express (NVMe)-optimized architecture with the reliability and innovation of IBM FlashCore® technology and the rich feature set and high availability (HA) of IBM Spectrum® Virtualize. Often, applications exist that are foundational to the operations and success of an enterprise. These applications might function as prime revenue generators, guide or control important tasks, or provide crucial business intelligence, among many other jobs. Whatever their purpose, they are mission critical to the organization. They demand the highest levels of performance, functionality, security, and availability. They also must be protected against the modern scourge, cyberattacks. To support such mission-critical applications, enterprises of all types and sizes turn to the IBM FlashSystem 9500. IBM FlashSystem 9500 provides a rich set of software-defined storage (SDS) features that are delivered by IBM Spectrum Virtualize, including the following examples: Data reduction and deduplication Dynamic tiering Thin-provisioning Snapshots Cloning Replication and data copy services Cyber resilience Transparent Cloud Tiering IBM HyperSwap® including 3-site replication for HA Scale-out and scale-up configurations that further enhance capacity and throughput for better availability With the release of IBM Spectrum Virtualize V8.5, extra functions and features are available, including support for new third-generation IBM FlashCore Modules NVMe-type drives within the control enclosure, and 100 Gbps Ethernet adapters that provide NVMe Remote Direct Memory Access (RDMA) options. New software features include GUI enhancements and security enhancements, including multifactor authentication (MFA) and single sign-on (SSO), and Fibre Channel (FC) portsets.

108 - Google Cloud’s Bruno Aziza on What Makes a Good Customer-Obsessed Data Product Manager

2023-01-10 · Experiencing Data w/ Brian T. O’Neill (AI & data product management leadership—powered by UX design) Listen

podcast_episode

by Bruno Aziza (Google Cloud) , Brian O’Neill (Designing for Analytics)

AI/ML Analytics BigQuery Dataflow Dataproc GCP Looker

Today I’m chatting with Bruno Aziza, Head of Data & Analytics at Google Cloud. Bruno leads a team of outbound product managers in charge of BigQuery, Dataproc, Dataflow and Looker and we dive deep on what Bruno looks for in terms of skills for these leaders. Bruno describes the three patterns of operational alignment he’s observed in data product management, as well as why he feels ownership and customer obsession are two of the most important qualities a good product manager can have. Bruno and I also dive into how to effectively abstract the core problem you’re solving, as well as how to determine whether a problem might be solved in a better way.

Highlights / Skip to:

Bruno introduces himself and explains how he created his “CarCast” podcast (00:45) Bruno describes his role at Google, the product managers he leads, and the specific Google Cloud products in his portfolio (02:36) What Bruno feels are the most important attributes to look for in a good data product manager (03:59) Bruno details how a good product manager focuses on not only the core problem, but how the problem is currently solved and whether or not that’s acceptable (07:20) What effective abstracting the problem looks like in Bruno’s view and why he positions product management as a way to help users move forward in their career (12:38) Why Bruno sees extracting value from data as the number one pain point for data teams and their respective companies (17:55) Bruno gives his definition of a data product (21:42) The three patterns Bruno has observed of operational alignment when it comes to data product management (27:57) Bruno explains the best practices he’s seen for cross-team goal setting and problem-framing (35:30)

Quotes from Today’s Episode

“What’s happening in the industry is really interesting. For people that are running data teams today and listening to us, the makeup of their teams is starting to look more like what we do [in] product management.” — Bruno Aziza (04:29)

“The problem is the problem, so focus on the problem, decompose the problem, look at the frictions that are acceptable, look at the frictions that are not acceptable, and look at how by assembling a solution, you can make it most seamless for the individual to go out and get the job done.” – Bruno Aziza (11:28)

“As a product manager, yes, we’re in the business of software, but in fact, I think you’re in the career management business. Your job is to make sure that whatever your customer’s job is that you’re making it so much easier that they, in fact, get so much more done, and by doing so they will get promoted, get the next job.” – Bruno Aziza (15:41)

“I think that is the task of any technology company, of any product manager that’s helping these technology companies: don’t be building a product that’s looking for a problem. Just start with the problem back and solution from that. Just make sure you understand the problem very well.” (19:52)

“If you’re a data product manager today, you look at your data estate and you ask yourself, ‘What am I building to save money? When am I building to make money?’ If you can do both, that’s absolutely awesome. And so, the data product is an asset that has been built repeatedly by a team and generates value out of data.” – Bruno Aziza (23:12)

“[Machine learning is] hard because multiple teams have to work together, right? You got your business analyst over here, you’ve got your data scientists over there, they’re not even the same team. And so, sometimes you’re struggling with just the human aspect of it.” (30:30)

“As a data leader, an IT leader, you got to think about those soft ways to accomplish the stuff that’s binary, that’s the hard [stuff], right? I always joke, the hard stuff is the soft stuff for people like us because we think about data, we think about logic, we think, ‘Okay if it makes sense, it will be implemented.’ For most of us, getting stuff done is through people. And people are emotional, how can you express the feeling of achieving that goal in emotional value?” – Bruno Aziza (37:36)

Links As referenced by Bruno, “Good Product Manager/Bad Product Manager”: https://a16z.com/2012/06/15/good-product-managerbad-product-manager/ LinkedIn: https://www.linkedin.com/in/brunoaziza/ Bruno’s Medium Article on Competing Against Luck by Clayton M. Christensen: https://brunoaziza.medium.com/competing-against-luck-3daeee1c45d4 The Data CarCast on YouTube: https://www.youtube.com/playlist?list=PLRXGFo1urN648lrm8NOKXfrCHzvIHeYyw

Automate Your Pipeline Creation For Streaming Data Transformations With SQLake

2023-01-08 · Data Engineering Podcast Listen

podcast_episode

by Ori Rafael (Upsolver) , Tobias Macey

Airflow Analytics BI Data Engineering Data Management dbt Monte Carlo PagerDuty SQL Data Streaming postgresql

Summary

Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more. Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. Your host is Tobias Macey and today I'm interviewing Ori Rafael about the SQLake feature for the Upsolver platform that automatically generates pipelines from your queries

Interview

Introduction How did you get involved in the area of data management? Can you describe what the SQLake product is and the story behind it?

What is the core problem that you are trying to solve?

What are some of the anti-patterns that you have seen teams adopt when designing and implementing DAGs in a tool such as Airlow? What are the benefits of merging the logic for transformation and orchestration into the same interface and dialect (SQL)? Can you describe the technical implementation of the SQLake feature? What does the workflow look like for designing and deploying pipelines in SQLake? What are the opportunities for using utilities such as dbt for managing logical complexity as the number of pipelines scales?

SQL has traditionally been challenging to compose. How did that factor into your design process for how to structure the dialect extensions for job scheduling?

What are some of the complexities that you have had to address in your orchestration system to be able to manage timeliness of operations as volume and complexity of the data scales? What are some of the edge cases that you have had to provide escape hatches for? What are the most interesting, innova

The Rise of FinOps: Cost Governance for Cloud-Based Analytics - Audio Blog

2023-01-05 · Secrets of Data Analytics Leaders Listen

podcast_episode

Analytics FinOps

As enterprises grow more dependent on the cloud and as the economy convulses, FinOps will soon become mandatory. Published at: https://www.eckerson.com/articles/the-rise-of-finops-cost-governance-for-cloud-based-analytics

Data Modeling with Tableau

2022-12-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Kirk Munroe

Analytics Data Modelling Tableau data data-engineering data-models

"Data Modeling with Tableau" provides a comprehensive guide to effectively utilizing Tableau Prep and Tableau Desktop for building elegant data models that drive organizational insights. You'll explore robust data modeling strategies and governance practices tailored to Tableau's diverse toolset, empowering you to make faster and more informed decisions based on data. What this Book will help me do Understand the fundamentals of data modeling in Tableau using Prep Builder and Desktop. Learn to optimize data sources for performance and better query capabilities. Implement secure and scalable governance strategies with Tableau Server and Cloud. Use advanced Tableau features like Ask Data and Explain Data to enable powerful analytics. Apply best practices for sharing and extending data models within your organization. Author(s) Kirk Munroe is an experienced data professional with a deep understanding of Tableau-driven analytics. With years of in-field expertise, Kirk now dedicates his career to helping businesses unlock their data's potential through effective Tableau solutions. His hands-on approach ensures this book is practical and approachable. Who is it for? This book is ideal for data analysts and business analysts aiming to enhance their skills in data modeling. It is also valuable for professionals such as data stewards, looking to implement secure and performant data strategies. If you seek to make enterprise data more accessible and actionable, this book is for you.

Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

2022-12-29 · Data Engineering Podcast Listen

podcast_episode

by Vishal Singh (Starburst) , Tobias Macey

Airflow Analytics CDP CI/CD Data Engineering Data Lake Data Management Data Quality Datafold dbt DWH Kubernetes +10 more

Summary

With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I'm interviewing Vishal Singh about his experience

talk-data.com

Activity Trend

Top Events

Top Speakers

SnowPro™ Core Certification Companion: Hands-on Preparation and Practice

Let's talk data transformation and what makes a great seller with Co-founder and CEO of Coalesce, Armon Petrossian. As more companies move to the cloud the weakness in data transformation is becoming profound.

Expert Performance Indexing in Azure SQL and SQL Server 2022: Toward Faster Results and Lower Maintenance Both on Premises and in the Cloud

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Let The Whole Team Participate In Data With The Quilt Versioned Data Hub

RISE with SAP towards a Sustainable Enterprise

Reflecting On The Past 6 Years Of Data Engineering

Navigating Career Changes in Machine Learning - Chris Szafranek

MANAGING DASHBOARDS AT SCALE WITH LOOKER STUDIO AND BIGQUERY

Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI

Analyzing a Downturn: Five Principles for Data & Analytics in 2023 - Audio Blog

Path to Strategy with Roger Premo, First Impressions of IBM {Replay}

Building Applications With Data As Code On The DataOS

IBM FlashSystem 9500 Product Guide

108 - Google Cloud’s Bruno Aziza on What Makes a Good Customer-Obsessed Data Product Manager

Automate Your Pipeline Creation For Streaming Data Transformations With SQLake

The Rise of FinOps: Cost Governance for Cloud-Based Analytics - Audio Blog

Data Modeling with Tableau

Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams