Python

To Debug a DAG: The Airflow local dev story

2023-07-01 · Airflow Summit 2023

session

by Daniel Imberman

Airflow

As much as we love airflow, local development has been a bit of a white whale through much of its history. Until recently, Airflow’s local development experience has been hindered by the need to spin up a scheduler and webserver. In this talk, we will explore the latest innovation in Airflow local development, namely the “dag.test()” functionality introduced in Airflow 2.5. We will delve into practical applications of “dag.test()”, which empowers users to locally run and debug Airflow DAGs on a single python process. This new functionality significantly improves the development experience, enabling faster iteration and deployment. In this presentation, we will discuss: How to leverage IDE support for code completion, linting, and debugging; Techniques for inspecting and debugging DAG output, Best practices for unit testing DAGs and their underlying functions. Accessible to Airflow users of all levels, join us as we explore the future of Airflow local development!

Time Series Indexing

2023-06-30 · O'Reilly Data Science Books O'Reilly Amazon

book

by Mihalis Tsoukalos

data data-science data-science-tasks statistics time-series

Time series data is at the heart of many applications, from finance and system monitoring to weather forecasting and medical data analysis. "Time Series Indexing" offers a hands-on guide to implementing and leveraging the iSAX indexing technique in Python to efficiently manage, search, and analyze time series data. What this Book will help me do Gain the know-how to implement algorithms like SAX and iSAX with illustrative Python examples. Learn to construct robust time series indexes tailored to real-world data sets. Understand the theoretical underpinnings of time series processing and indexing techniques. Explore and employ visualization techniques to interpret time series structures and insights. Gain the skills to adapt iSAX methodologies to other programming environments and practices. Author(s) Mihalis Tsoukalos is an accomplished developer and author specializing in Python programming and data processing techniques. With years of experience translating complex academic research into practical applications, Mihalis excels at bridging the gap between theory and practice. His writing approach ensures readers grasp both the foundational principles and the hands-on methods needed to succeed. Who is it for? This book best suits researchers, analysts, and developers who work with time series data and seek to elevate their proficiency in indexing and managing such data. It is perfect for professionals with a foundational knowledge of Python and programming concepts. This material also supports learners eager to derive actionable insights from theory-heavy academic research.

Getting Started with FiftyOne

2023-06-28 · Getting Started with FiftyOne Workshop (Americas)

workshop

by Jacob Marks (Voxel51)

fiftyone

A 90-minute, hands-on workshop led by Jacob Marks, PhD (Voxel51) introducing the FiftyOne open-source computer vision toolset. The session covers FiftyOne Basics (terms, architecture, installation, and general usage), an overview of useful workflows to explore, understand, and curate data, and how FiftyOne represents and semantically slices unstructured computer vision data. The second half provides a hands-on introduction to FiftyOne, including loading datasets from the FiftyOne Dataset Zoo, navigating the FiftyOne App, programmatically inspecting dataset attributes, adding new samples and custom attributes, generating and evaluating model predictions, and saving insightful views into the data.

Getting Started with FiftyOne Workshop

2023-06-28 · Getting Started with FiftyOne Workshop (Americas)

workshop

computer vision fiftyone

A 90-minute hands-on workshop led by Jacob Marks, PhD (Voxel51), introducing the FiftyOne open-source computer vision toolset. Part 1 covers FiftyOne basics (terms, architecture, installation, and general usage), useful workflows to explore, understand, and curate data, and how FiftyOne represents and semantically slices unstructured data. Part 2 is a hands-on introduction to FiftyOne, including loading datasets from the FiftyOne Dataset Zoo, navigating the FiftyOne App, programmatically inspecting attributes, adding new samples and custom attributes, generating and evaluating model predictions, and saving insightful views.

Klea Sanka - The World of Salary: Introducing Salary Prediction Models at StepStone

2023-06-27 · Members Talk Evening [in person and streamed]

talk

by Klea Sanka (StepStone GmbH)

cloud platforms

This presentation explores the salary landscape in the German job market, focusing on the challenges of data collection and approaches used to analyse it. We will discuss the importance of getting just the right features and how to balance the amount of data used. We will also examine the pipeline from experimentation to production models and the importance of keeping track of metrics and how we can automatise the process. Lastly, we will delve into the challenges of gender bias, data representation, and monotonicity, looking at how these factors impact our predictions, as well as prospects for future work.

Atelier Python: prévisions de ventes et de stock

2023-06-26 · [WEBINAR] Apprends les bases de Python pour faire des prédictions de vente

workshop

Atelier de 2 heures sur l’utilisation de Python pour établir des prévisions de ventes et de stock. Le format prévoit un cours d’une heure, puis l’accès à Learn (plateforme de e-learning) pour continuer à pratiquer.

Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

2023-06-25 · Data Engineering Podcast Listen

podcast_episode

by Toby Mao (SQLMesh) , Tobias Macey

AI/ML Airflow CDP Data Engineering Data Lake Data Management DataOps dbt GitHub ORC Pandas SAS +5 more

Summary

Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack- Your host is Tobias Macey and today I'm interviewing Toby Mao about SQLMesh, an open source DataOps framework designed to scale data transformations with ease of collaboration and validation built in

Interview

Introduction How did you get involved in the area of data management? Can you describe what SQLMesh is and the story behind it?

DataOps is a term that has been co-opted and overloaded. What are the concepts that you are trying to convey with that term in the context of SQLMesh?

What are the rough edges in existing toolchains/workflows that you are trying to address with SQLMesh?

How do those rough edges impact the productivity and effectiveness of teams using those

Can you describe how SQLMesh is implemented?

How have the design and goals evolved since you first started working on it?

What are the lessons that you have learned from dbt which have informed the design and functionality of SQLMesh? For teams who have already invested in dbt, what is the migration path from or integration with dbt? You have some built-in integration with/awareness of orchestrators (currently Airflow). What are the benefits of making the transformation tool aware of the orchestrator? What do you see as the potential benefits of integration with e.g. data-diff? What are the second-order benefits of using a tool such as SQLMesh that addresses the more mechanical aspects of managing transformation workfows and the associated dependency chains? What are the most interesting, innovative, or unexpected ways that you have seen SQLMesh used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLMesh? When is SQLMesh the wrong choice? What do you have planned for the future of SQLMesh?

Contact Info

tobymao on GitHub @captaintobs on Twitter Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

SQLMesh Tobiko Data SAS AirBnB Minerva SQLGlot Cron AST == Abstract Syntax Tree Pandas Terraform dbt

Podcast Episode

SQLFluff

Podcast.init Episode

The intro and outro music is from The Hug by The Freak Fandango Orc

How Column-Aware Development Tooling Yields Better Data Models

2023-06-18 · Data Engineering Podcast Listen

podcast_episode

by Satish Jayanthi , Tobias Macey

AI/ML CDP Cloud Computing Data Engineering Data Lake Data Management Data Modelling Data Vault dbt DWH ETL/ELT dimensional modeling +1 more

Summary

Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack- Your host is Tobias Macey and today I'm interviewing Satish Jayanthi about the practice and promise of building a column-aware data architecture through intentional modeling

Interview

Introduction How did you get involved in the area of data management? How has the move to the cloud for data warehousing/data platforms influenced the practice of data modeling?

There are ongoing conversations about the continued merits of dimensional modeling techniques in modern warehouses. What are the modeling practices that you have found to be most useful in large and complex data environments?

Can you describe what you mean by the term column-aware in the context of data modeling/data architecture?

What are the capabilities that need to be built into a tool for it to be effectively column-aware?

What are some of the ways that tools like dbt miss the mark in managing large/complex transformation workloads? Column-awareness is obviously critical in the context of the warehouse. What are some of the ways that that information can be fed into other contexts? (e.g. ML, reverse ETL, etc.) What is the importance of embedding column-level lineage awareness into transformation tool vs. layering on top w/ dedicated lineage/metadata tooling? What are the most interesting, innovative, or unexpected ways that you have seen column-aware data modeling used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building column-aware tooling? When is column-aware modeling the wrong choice? What are some additional resources that you recommend for individuals/teams who want to learn more about data modeling/column aware principles?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Coalesce

Podcast Episode

Star Schema Conformed Dimensions Data Vault

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack:

RudderStack provides all your customer data pipeli

Build Better Tests For Your dbt Projects With Datafold And data-diff

2023-06-11 · Data Engineering Podcast Listen

podcast_episode

by Gleb Mezhanskiy (Datafold) , Tobias Macey

AI/ML CDP Data Engineering Data Lake Data Management Datafold dbt Snowflake Data Streaming

Summary

Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about how to test your dbt projects with Datafold

Interview

Introduction How did you get involved in the area of data management? Can you describe what Datafold is and what's new since we last spoke? (July 2021 and July 2022 about data-diff) What are the roadblocks to data testing/validation that you see teams run into most often?

How does the tooling used contribute to/help address those roadblocks?

What are some of the error conditions/failure modes that data-diff can help identify in a dbt project?

What are some examples of tests that need to be implemented by the engineer?

In your experience working with data teams, what typically constitutes the "staging area" for a dbt project? (e.g. separate warehouse, namespaced tables, snowflake data copies, lakefs, etc.) Given a dbt project that is well tested and has data-diff as part of the validation suite, what are the challenges that teams face in managing the feedback cycle of running those tests? In application development there is the idea of the "testing pyramid", consisting of unit tests, integration tests, system tests, etc. What are the parallels to that in data projects?

What are the limitations of the data ecosystem that make testing a bigger challenge than it might otherwise be?

Beyond test execution, what are the other aspects of data health that need to be included in the development and deployment workflow of dbt projects? (e.g. freshness, time to delivery, etc.) What are the most interesting, innovative, or unexpected ways that you have seen Datafold and/or data-diff used for testing dbt projects? What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt testing internally or with your customers? When is Datafold/data-diff the wrong choice for dbt projects? What do you have planned for the future of Datafold?

Contact Info

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Datafold

Podcast Episode

data-diff

Podcast Episode

db

Lessons Learned from Freelancing and Working in a Start-up - Antonis Stellas

2023-06-09 · DataTalks.Club Listen

podcast_episode

by Antonis Stellas

AI/ML Analytics Data Engineering GitHub HTML Jenkins Kafka MLOps Data Streaming

We talked about;

Antonis' background The pros and cons of working for a startup Useful skills for working at a startup and the Lean way to work How Antonis joined the DataTalks.Club community Suggestions for students joining the MLOps course Antonis contributing to Evidently AI How Antonis started freelancing Getting your first clients on Upwork Pricing your work as a freelancer The process after getting approved by a client Wearing many hats as a freelancer and while working at a startup Other suggestions for getting clients as a freelancer Antonis' thoughts on the Data Engineering course Antonis' resource recommendations

Links:

Lean Startup by Eric Ries: https://theleanstartup.com/ Lean Analytics: https://leananalyticsbook.com/ Designing Machine Learning Systems by Chip Huyen: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/ Kafka Streaming with python by Khris Jenkins tutorial video: https://youtu.be/jItIQ-UvFI4

Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Power BI Machine Learning and OpenAI

2023-05-31 · O'Reilly Data Science Books O'Reilly Amazon

book

by Greg Beaumont

AI/ML Analytics BI Data Analytics Data Science DataViz LLM Microsoft Power BI SaaS business-intelligence data +3 more

Microsoft Power BI Machine Learning and OpenAI offers a comprehensive exploration into advanced data analytics and artificial intelligence using Microsoft Power BI. Through hands-on, workshop-style examples, readers will discover the integration of machine learning models and OpenAI features to enhance business intelligence. This book provides practical examples, real-world scenarios, and step-by-step guidance. What this Book will help me do Learn to apply machine learning capabilities within Power BI to create predictive analytics Understand how to integrate OpenAI services to build enhanced analytics workflows Gain hands-on experience in using R and Python for advanced data visualization in Power BI Master the skills needed to build and deploy SaaS auto ML models within Power BI Leverage Power BI's AI visuals and features to elevate data storytelling Author(s) Greg Beaumont, an expert in data science and business intelligence, brings years of experience in Power BI and analytics to this book. With a focus on practical applications, Greg empowers readers to harness the power of AI and machine learning to elevate their data solutions. As a consultant and trainer, he shares his deep knowledge to help readers unlock the full potential of their tools. Who is it for? This book is ideal for data analysts, BI professionals, and data scientists who aim to integrate machine learning and OpenAI into their workflows. If you're familiar with Power BI's fundamentals and are eager to explore its advanced capabilities, this guide is tailored for you. Perfect for professionals looking to elevate their analytics to a new level, combining data science concepts with Power BI's features.

A Roadmap To Bootstrapping The Data Team At Your Startup

2023-05-29 · Data Engineering Podcast Listen

podcast_episode

by Ghalib Suleiman , Tobias Macey

AI/ML CDP Data Engineering Data Lake Data Management ETL/ELT Data Streaming

Summary

Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Ghalib Suleiman about challenges and strategies for building data teams in a startup

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing your conception of the responsibilities of a data team? What are some of the common fallacies that organizations fall prey to in their first efforts at building data capabilities?

Have you found it more practical to hire outside talent to build out the first data systems, or grow that talent internally? What are some of the resources you have found most helpful in training/educating the early creators and consumers of data assets?

When there is no internal data talent to assist with hiring, what are some of the problems that manifest in the hiring process?

What are the concepts that the new hire needs to know? How much does the hiring manager/interviewer need to know about those concepts to evaluate skill?

What are the most critical skills for a first hire to have to start generating valuable output? As a solo data person, what are the uphill battles that they need to be prepared for in the organization?

What are the rabbit holes that they should beware of?

What are some of the tactical What are the most interesting, innovative, or unexpected ways that you have seen initial data hires tackle startup challenges? What are the most interesting, unexpected, or challenging lessons that you have learned while working on starting and growing data teams? When is it more practical to outsource the data work?

Contact Info

LinkedIn @ghalib on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Polytomic

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack:

RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

RudderStack’s warehouse-first approach means it does not store sensitive information, and i

MySQL Crash Course

2023-05-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rick Silva

Java MySQL SQL data data-engineering relational-databases

MySQL Crash Course is a fast-paced, no-nonsense introduction to relational database development. It’s filled with practical examples and expert advice that will have you up and running quickly. You’ll learn the basics of SQL, how to create a database, craft SQL queries to extract data, and work with events, procedures, and functions. You’ll see how to add constraints to tables to enforce rules about permitted data and use indexes to accelerate data retrieval. You’ll even explore how to call MySQL from PHP, Python, and Java. Three final projects will show you how to build a weather database from scratch, use triggers to prevent errors in an election database, and use views to protect sensitive data in a salary database. You’ll also learn how to: •Query database tables for specific information, order the results, comment SQL code, and deal with null values •Define table columns to hold strings, integers, and dates, and determine what data types to use •Join multiple database tables as well as use temporary tables, common table expressions, derived tables, and subqueries •Add, change, and remove data from tables, create views based on specific queries, write reusable stored routines, and automate and schedule events The perfect quick-start resource for database developers, MySQL Crash Course will arm you with the tools you need to build and manage fast, powerful, and secure MySQL-based data storage systems.

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

2023-05-21 · Data Engineering Podcast Listen

podcast_episode

by Johnny Graettinger , David Yaffe , Tobias Macey

AI/ML API CDP Data Engineering Data Lake Data Management Data Streaming

Summary

Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing David Yaffe and Johnny Graettinger about using streaming data to build a real-time data lake and how Estuary gives you a single path to integrating and transforming your various sources

Interview

Introduction How did you get involved in the area of data management? Can you describe what Estuary is and the story behind it? Stream processing technologies have been around for around a decade. How would you characterize the current state of the ecosystem?

What was missing in the ecosystem of streaming engines that motivated you to create a new one from scratch?

With the growth in tools that are focused on batch-oriented data integration and transformation, what are the reasons that an organization should still invest in streaming?

What is the comparative level of difficulty and support for these disparate paradigms?

What is the impact of continuous data flows on dags/orchestration of transforms? What role do modern table formats have on the viability of real-time data lakes? Can you describe the architecture of your Flow platform?

What are the core capabilities that you are optimizing for in its design?

What is involved in getting Flow/Estuary deployed and integrated with an organization's data systems? What does the workflow look like for a team using Estuary?

How does it impact the overall system architecture for a data platform as compared to other prevalent paradigms?

How do you manage the translation of poll vs. push availability and best practices for API and other non-CDC sources? What are the most interesting, innovative, or unexpected ways that you have seen Estuary used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Estuary? When is Estuary the wrong choice? What do you have planned for the future of Estuary?

Contact Info

Dave Y Johnny G

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcas

What Happens When The Abstractions Leak On Your Data

2023-05-15 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey

AI/ML Airbyte API AWS AWS Glue BigQuery CDP Dagster Data Engineering Data Lake Data Lakehouse Data Management +10 more

Summary

All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow

Interview

Introduction impact of community tech debt

hive metastore new work being done but not widely adopted

tensions between automation and correctness data type mapping

integer types complex types naming things (keys/column names from APIs to databases)

disaggregated databases - pros and cons

flexibility and cost control not as much tooling invested vs. Snowflake/BigQuery/Redshift

data modeling

dimensional modeling vs. answering today's questions

What are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform? When is ELT the wrong choice? What do you have planned for the future of your data platform?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

dbt Airbyte

Podcast Episode

Dagster

Podcast Episode

Trino

Podcast Episode

ELT Data Lakehouse Snowflake BigQuery Redshift Technical Debt Hive Metastore AWS Glue

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack:

RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.

RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.

Visit dataengineeringpodcast.com/rudderstack to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Support Data Engineering Podcast

Automatically Fix Data Issues & Label Errors in Most ML Datasets | Cleanlab

2023-05-11 · Data Council 2023 Watch

video

by Curtis Northcutt (Cleanlab)

AI/ML Analytics Data Engineering GitHub

ABOUT THE TALK: In this talk, we discuss cleanlab open-source (github.com/cleanlab/cleanlab) and Cleanlab Studio (https://cleanlab.ai/studio). Cleanlab open-source is a fast-growing python framework for data-centric AI that automatically detects issues in ML datasets. Cleanlab Studio is a no-code web interface used by universities and fortune 500 companies for dataset issue detection and fixing. Cleanlab algorithms have theoretical support for improved accuracy on real-world, messy data.

ABOUT THE SPEAKER: Curtis Northcutt is an American computer scientist and entrepreneur focusing on machine learning and AI to empower people. He is the CEO and co-founder of Cleanlab, an AI software company that improves machine learning model performance by automatically fixing data and label issues in real-world, messy datasets. Curtis completed his PhD at MIT where he invented Cleanlab’s algorithms for automatically finding and fixing label issues in any dataset.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Hierarchical Forecasting in Python | Nixtla

2023-05-11 · Data Council 2023 Watch

video

by Max Mergenthaler (Nixtla)

AI/ML Analytics Data Engineering Data Science

A vast amount of time series datasets are organized into structures with different levels or hierarchies of aggregation.

In this talk, we introduce the open-source Hierarchical Forecast library, which contains different reconciliation algorithms, preprocessed datasets, evaluation metrics, and a compiled set of statistical baseline models. This Python-based framework aims to bridge the gap between statistical modeling and Machine Learning in the time series field.

ABOUT THE SPEAKER: Max Mergenthaler is the CEO and Co-Founder of Nixtla, a time-series research and deployment startup. He is also a seasoned entrepreneur with a proven track record as the founder of multiple technology startups. With a decade of experience in the ML industry, he has extensive expertise in building and leading international data teams. Max has also made notable contributions to the Data Science field through his co-authorship of papers on forecasting algorithms and decision theory.

👉 Sign up for our “No BS” Newsletter to get the latest technical data & AI content: https://datacouncil.ai/newsletter

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

HuggingFace + Ray AIR Integration: A Python Developer’s Guide to Scaling Transformers | AnyScale

2023-05-11 · Data Council 2023 Watch

video

by Antoni Baum (Anyscale) , Jules S. Damji (Anyscale Inc)

AI/ML Analytics API Data Engineering Databricks Spark

ABOUT THE TALK: Hugging Face Transformers is a popular open-source project with cutting-edge Machine Learning (ML). Still, meeting the computational requirements for advanced models it provides often requires scaling beyond a single machine. This session explores the integration between Hugging Face and Ray AI Runtime (AIR), allowing users to scale their model training and data loading seamlessly. We will dive deep into the implementation and API and explore how we can use Ray AIR to create an end-to-end Hugging Face workflow, from data ingest through fine-tuning and HPO to inference and serving.

ABOUT THE SPEAKERS: Jules S. Damji is a lead developer advocate at Anyscale Inc, an MLflow contributor, and co-author of Learning Spark, 2nd Edition. He is a hands-on developer with over 25 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, @Home, Opsware/LoudCloud, VeriSign, ProQuest, Hortonworks, and Databricks, building large-scale distributed systems.

Antoni Baum is a software engineer at Anyscale, working on Ray Tune, XGBoost-Ray, Ray AIR, and other ML libraries. In his spare time, he contributes to various open source projects, trying to make machine learning more accessible and approachable.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

How to Interpret & Explain Your Black Box Models | Anaconda

2023-05-11 · Data Council 2023 Watch

video

by Sophia Yang (Anaconda)

AI/ML Analytics Data Engineering Data Science

ABOUT THE TALK: There has been an increasing interest in machine learning model interpretability and explainability. Researchers and ML practitioners have designed many explanation techniques such as explainable boosting machine, visual analytics, distillation, prototypes, saliency map, counterfactual, feature visualization, LIME, SHAP, interpretML, and TCAV. In this talk, Sophia Yang provides a high-level overview of the popular model explanation techniques.

ABOUT THE SPEAKER: Sophia Yang is a Senior Data Scientist and a Developer Advocate at Anaconda. She is passionate about the data science community and the Python open-source community. She is the author of multiple Python open-source libraries such as condastats, cranlogs, PyPowerUp, intake-stripe, and intake-salesforce.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Publishing Jupyter Notebooks with Quarto | RStudio

2023-05-11 · Data Council 2023 Watch

video

by J.J. Allaire (RStudio)

AI/ML Analytics Confluence Data Engineering TensorFlow

ABOUT THE TALK: Quarto is a multi-language, open-source toolkit for creating data-driven websites, reports, presentations, and scientific articles, built on Jupyter.

This talk teaches you how to use Quarto to publish Jupyter notebooks as production quality websites, books, blogs, presentations, PDFs, Office documents, and more. It covers how to publish notebooks within existing content management systems like Hugo, Docusaurus, and Confluence and also explore how Quarto works under the hood along with how the system can be extended to accommodate unique requirements and workflows.

ABOUT THE SPEAKER: J.J. Allaire is the founder of RStudio and the creator of the RStudio IDE. He is an author of several packages in the R Markdown publishing ecosystem and has also worked extensively on the R interfaces to Python and TensorFlow. J.J. is now leading the Quarto project, which is a new Jupyter-based scientific and technical publishing system.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai

talk-data.com

Activity Trend

Top Events

Top Speakers

To Debug a DAG: The Airflow local dev story

Time Series Indexing

Getting Started with FiftyOne

Getting Started with FiftyOne Workshop

Klea Sanka - The World of Salary: Introducing Salary Prediction Models at StepStone

Atelier Python: prévisions de ventes et de stock

Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

How Column-Aware Development Tooling Yields Better Data Models

Build Better Tests For Your dbt Projects With Datafold And data-diff

Lessons Learned from Freelancing and Working in a Start-up - Antonis Stellas

Power BI Machine Learning and OpenAI

A Roadmap To Bootstrapping The Data Team At Your Startup

MySQL Crash Course

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

What Happens When The Abstractions Leak On Your Data

Automatically Fix Data Issues & Label Errors in Most ML Datasets | Cleanlab

Hierarchical Forecasting in Python | Nixtla

HuggingFace + Ray AIR Integration: A Python Developer’s Guide to Scaling Transformers | AnyScale

How to Interpret & Explain Your Black Box Models | Anaconda

Publishing Jupyter Notebooks with Quarto | RStudio