Python

Exploring The Nuances Of Building An Intentional Data Culture

2023-03-06 · Data Engineering Podcast Listen

podcast_episode

by Pete Soderling (Data Council) , Maggie Hays (DataHub) , Tobias Macey

AI/ML Data Engineering Data Management dbt Modern Data Stack

Summary

The ecosystem for data professionals has matured to the point that there are a large and growing number of distinct roles. With the scope and importance of data steadily increasing it is important for organizations to ensure that everyone is aligned and operating in a positive environment. To help facilitate the nascent conversation about what constitutes an effective and productive data culture, the team at Data Council have dedicated an entire conference track to the subject. In this episode Pete Soderling and Maggie Hays join the show to explore this topic and their experience preparing for the upcoming conference.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Pete Soderling and Maggie Hays about the growing importance of establishing and investing in an organization's data culture and their experience forming an entire conference track around this topic

Interview

Introduction How did you get involved in the area of data management? Can you describe what your working definition of "Data Culture" is?

In what ways is a data culture distinct from an organization's corporate culture? How are they interdependent? What are the elements that are most impactful in forming the data culture of an organization?

What are some of the motivations that teams/companies might have in fighting against the creation and support of an explicit data culture?

Are there any strategies that you have found helpful in counteracting those tendencies?

In terms of the conference, what are the factors that you consider when deciding how to group the different presentations into tracks or themes?

What are the experiences that you have had personally and in community interactions that led you to elevate data culture to be it's own track?

What are the broad challenges that practitioners are facing as they develop their own understanding of what constitutes a healthy and productive data culture? What are some of the risks that you considered when forming this track and evaluating proposals? What are your criteria for determining whether this track is successful? What are the most interesting, innovative, or unexpected aspects of data culture that you have encountered through developing this track? What are the most interesting, unexpected, or challenging lessons that you have learned while working on selecting presentations for this year's event? What do you have planned for the future of this topic at Data Council events?

Contact Info

Pete

@petesoder on Twitter LinkedIn

Maggie

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Data Council

Podcast Episode

Data Community Fund DataHub

Podcast Episode

Database Design For Mere Mortals by Michael J. Hernandez (affiliate link) SOAP REST Econometrics DBA == Database Administrator Conway's Law dbt

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: TimeXtender: TimeXtender Logo TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.

You can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.

Go to dataengineeringpodcast.com/timextender today to get started for free!Support Data Engineering Podcast

Applied Geospatial Data Science with Python

2023-02-28 · O'Reilly Data Science Books O'Reilly Amazon

book

by David S. Jordan

AI/ML Analytics Data Science GIS programming-languages software-development

"Applied Geospatial Data Science with Python" introduces readers to the power of integrating geospatial data into data science workflows. This book equips you with practical methods for processing, analyzing, and visualizing spatial data to solve real-world problems. Through hands-on examples and clear, actionable advice, you will master the art of spatial data analysis using Python. What this Book will help me do Learn to process, analyze, and visualize geospatial data using Python libraries. Develop a foundational understanding of GIS and geospatial data science principles. Gain skills in building geospatial AI and machine learning models for specific use cases. Apply geospatial data workflows to practical scenarios like optimization and clustering. Create a portfolio of geospatial data science projects relevant across different industries. Author(s) David S. Jordan is an experienced data scientist with years of expertise in GIS and geospatial analytics. With a passion for making complex topics accessible, David leverages his deep technical knowledge to provide practical, hands-on instruction. His approach emphasizes real-world applications and encourages learners to develop confidence as they work with geospatial data. Who is it for? This book is perfect for data scientists looking to integrate geospatial data analysis into their existing workflows, and GIS professionals seeking to expand into data science. If you already have a basic knowledge of Python for data analysis or data science and want to explore how to work effectively with geospatial data to drive impactful solutions, this is the book for you.

Building A Data Mesh Platform At PayPal

2023-02-27 · Data Engineering Podcast Listen

podcast_episode

by Jean-Georges Perrin (Actian) , Tobias Macey

AI/ML Data Contracts Data Engineering Data Management Data Quality Modern Data Stack

Summary

There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Jean-Georges Perrin about his work at PayPal to implement a data mesh and the role of data contracts in making it work

Interview

Introduction How did you get involved in the area of data management? Can you start by describing the goals and scope of your work at PayPal to implement a data mesh?

What are the core problems that you were addressing with this project? Is a data mesh ever "done"?

What was your experience engaging at the organizational level to identify the granularity and ownership of the data products that were needed in the initial iteration? What was the impact of leading multiple teams on the design of how to implement communication/contracts throughout the mesh? What are the technical systems that you are relying on to power the different data domains?

What is your philosophy on enforcing uniformity in technical systems vs. relying on interface definitions as the unit of consistency?

What are the biggest challenges (technical and procedural) that you have encountered during your implementation? How are you managing visibility/auditability across the different data domains? (e.g. observability, data quality, etc.) What are the most interesting, innovative, or unexpected ways that you have seen PayPal's data mesh used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh? When is a data mesh the wrong choice? What do you have planned for the future of your data mesh at PayPal?

Contact Info

LinkedIn Blog

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Data Mesh

O'Reilly Book (affiliate link)

The next generation of Data Platforms is the Data Mesh PayPal Conway's Law Data Mesh For All Ages - US, Data Mesh For All Ages - UK Data Mesh Radio Data Mesh Community Data Mesh In Action Great Expectations

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: TimeXtender: TimeXtender Logo TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.

You can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.

Go to dataengineeringpodcast.com/timextender today to get started for free!Support Data Engineering Podcast

Experimentation for Engineers

2023-02-23 · O'Reilly Data Science Books O'Reilly Amazon

book

by David Sweet

AI/ML Data Science NumPy bayesian-statistics data data-science data-science-tasks statistics

Optimize the performance of your systems with practical experiments used by engineers in the world’s most competitive industries. In Experimentation for Engineers: From A/B testing to Bayesian optimization you will learn how to: Design, run, and analyze an A/B test Break the "feedback loops" caused by periodic retraining of ML models Increase experimentation rate with multi-armed bandits Tune multiple parameters experimentally with Bayesian optimization Clearly define business metrics used for decision-making Identify and avoid the common pitfalls of experimentation Experimentation for Engineers: From A/B testing to Bayesian optimization is a toolbox of techniques for evaluating new features and fine-tuning parameters. You’ll start with a deep dive into methods like A/B testing, and then graduate to advanced techniques used to measure performance in industries such as finance and social media. Learn how to evaluate the changes you make to your system and ensure that your testing doesn’t undermine revenue or other business metrics. By the time you’re done, you’ll be able to seamlessly deploy experiments in production while avoiding common pitfalls. About the Technology Does my software really work? Did my changes make things better or worse? Should I trade features for performance? Experimentation is the only way to answer questions like these. This unique book reveals sophisticated experimentation practices developed and proven in the world’s most competitive industries that will help you enhance machine learning systems, software applications, and quantitative trading solutions. About the Book Experimentation for Engineers: From A/B testing to Bayesian optimization delivers a toolbox of processes for optimizing software systems. You’ll start by learning the limits of A/B testing, and then graduate to advanced experimentation strategies that take advantage of machine learning and probabilistic methods. The skills you’ll master in this practical guide will help you minimize the costs of experimentation and quickly reveal which approaches and features deliver the best business results. What's Inside Design, run, and analyze an A/B test Break the “feedback loops” caused by periodic retraining of ML models Increase experimentation rate with multi-armed bandits Tune multiple parameters experimentally with Bayesian optimization About the Reader For ML and software engineers looking to extract the most value from their systems. Examples in Python and NumPy. About the Author David Sweet has worked as a quantitative trader at GETCO and a machine learning engineer at Instagram. He teaches in the AI and Data Science master's programs at Yeshiva University. Quotes Putting an ‘improved’ version of a system into production can be really risky. This book focuses you on what is important! - Simone Sguazza, University of Applied Sciences and Arts of Southern Switzerland A must-have for anyone setting up experiments, from A/B tests to contextual bandits and Bayesian optimization. - Maxim Volgin, KLM Shows a non-mathematical programmer exactly what they need to write powerful mathematically-based testing algorithms. - Patrick Goetz, The University of Texas at Austin Gives you the tools you need to get the most out of your experiments. - Marc-Anthony Taylor, Raiffeisen Bank International

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

2023-02-19 · Data Engineering Podcast Listen

podcast_episode

by Ryan Blue (Tabular) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management GitHub Iceberg Modern Data Stack

Summary

Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular

Interview

Introduction How did you get involved in the area of data management? Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem?

Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations?

What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018? Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects?

Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons?

For someone who wants to manage their data in Iceberg tables, what does the implementation look like?

How does that change based on the type of query/processing engine being used?

Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular? When is Iceberg/Tabular the wrong choice? What do you have planned for the future of Iceberg/Tabular?

Contact Info

LinkedIn rdblue on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the

Data Mining and Predictive Analytics for Business Decisions

2023-02-13 · O'Reilly Data Science Books O'Reilly Amazon

book

by Andres Fortino

AI/ML Analytics Data Science data data-science data-science-tasks exploratory-data-analysis

With many recent advances in data science, we have many more tools and techniques available for data analysts to extract information from data sets. This book will assist data analysts to move up from simple tools such as Excel for descriptive analytics to answer more sophisticated questions using machine learning. Most of the exercises use R and Python, but rather than focus on coding algorithms, the book employs interactive interfaces to these tools to perform the analysis. Using the CRISP-DM data mining standard, the early chapters cover conducting the preparatory steps in data mining: translating business information needs into framed analytical questions and data preparation. The Jamovi and the JASP interfaces are used with R and the Orange3 data mining interface with Python. Where appropriate, Voyant and other open-source programs are used for text analytics. The techniques covered in this book range from basic descriptive statistics, such as summarization and tabulation, to more sophisticated predictive techniques, such as linear and logistic regression, clustering, classification, and text analytics. Includes companion files with case study files, solution spreadsheets, data sets and charts, etc. from the book. Features: Covers basic descriptive statistics, such as summarization and tabulation, to more sophisticated predictive techniques, such as linear and logistic regression, clustering, classification, and text analytics Uses R, Python, Jamovi and JASP interfaces, and the Orange3 data mining interface Includes companion files with the case study files from the book, solution spreadsheets, data sets, etc.

Let The Whole Team Participate In Data With The Quilt Versioned Data Hub

2023-02-11 · Data Engineering Podcast Listen

podcast_episode

by Aneesh Karve (Quilt Data) , Tobias Macey

AI/ML Avro CloudFormation Cloud Computing Data Engineering Data Management Delta Docker Iceberg ORC Parquet S3 +3 more

Summary

Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Your host is Tobias Macey and today I'm interviewing Aneesh Karve about how Quilt Data helps you bring order to your chaotic data in S3 with transactional versioning and data discovery built in

Interview

Introduction How did you get involved in the area of data management? Can you describe what Quilt is and the story behind it?

How have the goals and features of the Quilt platform changed since I spoke with Kevin in June of 2018?

What are the main problems that users are trying to solve when they find Quilt?

What are some of the alternative approaches/products that they are coming from?

How does Quilt compare with options such as LakeFS, Unstruk, Pachyderm, etc.? Can you describe how Quilt is implemented? What are the types of tools and systems that Quilt gets integrated with?

How do you manage the tension between supporting the lowest common denominator, while providing options for more advanced capabilities?

What is a typical workflow for a team that is using Quilt to manage their data? What are the most interesting, innovative, or unexpected ways that you have seen Quilt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Quilt? When is Quilt the wrong choice? What do you have planned for the future of Quilt?

Contact Info

LinkedIn @akarve on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Quilt Data

Podcast Episode

UW Madison Docker Swarm Kaggle open.quiltdata.com FinOS Perspective LakeFS

Podcast Episode

Pachyderm

Podcast Episode

Unstruk

Podcast Episode

Parquet Avro ORC Cloudformation Troposphere CDK == Cloud Development Kit Shadow IT

Podcast Episode

Delta Lake

Podcast Episode

Apache Iceberg

Podcast Episode

Datasette Frictionless DVC

Podcast.init Episode

The in

45: 3-Step Guide To Building Your First Data Science Project

2023-02-09 · Data Career Podcast: Helping You Land a Data Analyst Job FAST Listen

podcast_episode

by Avery Smith

AI/ML Analytics Data Analytics Data Science SQL Tableau

You just learned SQL or Python, or Tableau. But you don’t know how to build your data science project? In this episode, Avery shares a 3-step guide to building your first data science project.

🌟 Join the data project club!

“25OFF” to get 25% off (first 50 members).

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Timestamps:

(1:28) - Art is theft, and so is the data science project

(4:02) - Find ideas on Towards Data Science Medium

(5:32) - Read a few articles to get inspiration

(6:05) - Avery’s strategy is doing 30 projects in 30 days

(9:08) - How academia finds inspiration to write

(11:01) - Take Avery’s project, replicate and do it

Mentioned Links:

Building 30 Data Science Projects in 30 days: https://youtu.be/kKmA9ihIg20

30 Data Science Projects Resources: https://www.datacareerjumpstart.com/30projectsresourcesignup

I Used Data Science to UNCOVER McDonald’s Healthiest Meal: https://youtu.be/3bbFc1225-4

Connect with Avery:

📺 Subscribe on YouTube: https://www.youtube.com/c/AverySmithDataCareerJumpstart/videos 🎙Listen to My Podcast: https://podcasts.apple.com/us/podcast/data-career-podcast/id1547386535 👔 Connect with me on LinkedIn: https://www.linkedin.com/in/averyjsmith/ 📸 Instagram: https://www.instagram.com/datacareerjumpstart/ 🎵 TikTok: [https://www.tiktok.com/@verydata?]

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Reflecting On The Past 6 Years Of Data Engineering

2023-02-06 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey

AI/ML Airflow Alation Analytics API AWS Lambda BI Big Data Cloud Computing Dagster Data Engineering Data Management +12 more

Summary

This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years

Interview

Introduction 6 years of running the Data Engineering Podcast Around the first time that data engineering was discussed as a role

Followed on from hype about "data science"

Hadoop era Streaming Lambda and Kappa architectures

Not really referenced anymore

"Big Data" era of capture everything has shifted to focusing on data that presents value

Regulatory environment increases risk, better tools introduce more capability to understand what data is useful

Data catalogs

Amundsen and Alation

Orchestration engine

Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc. Orchestration is now a part of most vertical tools

Cloud data warehouses Data lakes DataOps and MLOps Data quality to data observability Metadata for everything

Data catalog -> data discovery -> active metadata

Business intelligence

Read only reports to metric/semantic layers Embedded analytics and data APIs

Rise of ELT

dbt Corresponding introduction of reverse ETL

What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast? What do you have planned for the future of the podcast?

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Materialize:

Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use.

Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.

Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.

Go to materialize.comSupport Data Engineering Podcast

Graph Data Science with Neo4j

2023-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Estelle Scifo (Neo4j)

AI/ML Data Science Neo4j data data-engineering graph-databases

"Graph Data Science with Neo4j" teaches you how to utilize Neo4j 5 and its Graph Data Science Library 2.0 for analyzing and making predictions with graph data. By integrating graph algorithms into actionable machine learning pipelines using Python, you'll harness the power of graph-based data models. What this Book will help me do Query and manipulate graph data using Cypher in Neo4j. Design and implement graph datasets using your data and public sources. Utilize graph-specific algorithms for tasks such as link prediction. Integrate graph data science pipelines into machine learning projects. Understand and apply predictive modeling using the GDS Library. Author(s) None Scifo, the author of "Graph Data Science with Neo4j," is an experienced data scientist with expertise in graph databases and advanced machine learning techniques. Their technical approach combines practical implementation with clear, step-by-step guidance to provide readers the skills they need to excel. Who is it for? This book is ideal for data scientists and analysts familiar with basic Neo4j concepts and Python-based data science workflows who wish to deepen their skills in graph algorithms and machine learning integration. It is particularly suited for professionals aiming to advance their expertise in graph data science for practical applications.

Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

2023-01-30 · Data Engineering Podcast Listen

podcast_episode

by Chris Merrick (Omni Analytics) , Tobias Macey

AI/ML Analytics Arrow BI BigQuery Cloud Computing Data Engineering Data Management Data Modelling dbt DuckDB Fivetran +9 more

Summary

Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Your host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence

Interview

Introduction How did you get involved in the area of data management? Can you describe what Omni Analytics is and the story behind it?

What are the core goals that you are trying to achieve with building Omni?

Business intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market?

What are the technical and organizational anti-patterns that typically grow up around BI systems?

What are the elements that contribute to BI being such a difficult product to use effectively in an organization?

Can you describe how you have implemented the Omni platform?

How have the design/scope/goals of the product changed since you first started working on it?

What does the workflow for a team using Omni look like?

What are some of the developments in the broader ecosystem that have made your work possible?

What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses?

What are the most interesting, innovative, or unexpected ways that you have seen Omni used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni?

When is Omni the wrong choice?

What do you have planned for the future of Omni?

Contact Info

LinkedIn @cmerrick on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Omni Analytics Stitch RJ Metrics Looker

Podcast Episode

Singer dbt

Podcast Episode

Teradata Fivetran Apache Arrow

Podcast Episode

DuckDB

Podcast Episode

BigQuery Snowflake

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Materialize:

Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use.

Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.

Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.

Go to materialize.comSupport Data Engineering Podcast

Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI

2023-01-22 · Data Engineering Podcast Listen

podcast_episode

by Adam Kamor , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Management SQL Data Streaming postgresql

Summary

The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more. Your host is Tobias Macey and today I'm interviewing Adam Kamor about Tonic, a service for generating data sets that are safe for development, analytics, and machine learning

Interview

Introduction How did you get involved in the area of data management? Can you describe what Tonic is and the story behind it? What are the core problems that you are trying to solve? What are some of the ways that fake or obfuscated data is used in development and analytics workflows? challenges of reliably subsetting data

impact of ORMs and bad habits developers get into with database modeling

Can you describe how Tonic is implemented?

What are the units of composition that you are building to allow for evolution and expansion of your product? How have the design and goals of the platform evolved since you started working on it?

Can you describe some of the different workflows that customers build on top of your various tools What are the most interesting, innovative, or unexpected ways that you have seen Tonic used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Tonic? When is Tonic the wrong choice? What do you have planned for the future of Tonic?

Contact Info

LinkedIn @AdamKamor on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Tonic

Djinn

Django

Episode 112: 2022 Retro & Running!

2023-01-13 · ADSP: Algorithms + Data Structures = Programs Listen

podcast_episode

by Conor Hoekstra , Bryce Adelstein Lelbach (NVIDIA)

C++ Rust

In this episode, Conor and Bryce conclude their 2022 retro and talk about running! Link to Episode 112 on Website

Twitter ADSP: The PodcastConor HoekstraBryce Adelstein LelbachShow Notes Date Recorded: 2023-01-04 Date Released: 2023-01-13 NVIDIA/stdexec - Senders - A Standard Model for Asynchronous Execution in C++Rust Programming LanguageLanguishTalk Python To MeLightning Talk: Runner’s Guide to C++ Conferences - Timur Doumler - CppNorth 2022Optic FlowWorld Marathon Majorscode::dive ConferenceLamdaDays ConferenceStrange Loop ConferenceIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Essential Math for AI

2023-01-05 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Hala Nelson

AI/ML NLP ai-ml data machine-learning

Companies are scrambling to integrate AI into their systems and operations. But to build truly successful solutions, you need a firm grasp of the underlying mathematics. This accessible guide walks you through the math necessary to thrive in the AI field such as focusing on real-world applications rather than dense academic theory. Engineers, data scientists, and students alike will examine mathematical topics critical for AI--including regression, neural networks, optimization, backpropagation, convolution, Markov chains, and more--through popular applications such as computer vision, natural language processing, and automated systems. And supplementary Jupyter notebooks shed light on examples with Python code and visualizations. Whether you're just beginning your career or have years of experience, this book gives you the foundation necessary to dive deeper in the field. Understand the underlying mathematics powering AI systems, including generative adversarial networks, random graphs, large random matrices, mathematical logic, optimal control, and more Learn how to adapt mathematical methods to different applications from completely different fields Gain the mathematical fluency to interpret and explain how AI systems arrive at their decisions

Episode CX: Compiler Diagnostics

2022-12-30 · ADSP: Algorithms + Data Structures = Programs Listen

podcast_episode

by Conor Hoekstra , Bryce Adelstein Lelbach (NVIDIA)

In this episode, Conor and Bryce talk about compiler diagnostics and how we can improve them. Link to Episode 110 on Website

Twitter ADSP: The PodcastConor HoekstraBryce Adelstein LelbachShow Notes Date Recorded: 2022-12-22 Date Released: 2022-12-30 What is a Compiler DiagnosticClang’s Expressive DiagnosticsThe Elm Programming LanguageCompiler Driven DevelopmentRust mut keywordC++ const keywordVSCode Error Lens ExtensionC++17 [[nodiscard]]Python f-stringsPythong f-string =The art of printf() debuggingJT on TwitterIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Pandas for Everyone: Python Data Analysis, 2nd Edition

2022-12-22 · O'Reilly Data Science Books O'Reilly Amazon

book

by Daniel Y. Chen

AI/ML Data Science DataViz Matplotlib Pandas Scikit-learn Seaborn data data-science data-science-tools

Manage and Automate Data Analysis with Pandas in Python Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task, no matter how large or complex. Pandas can help you ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple data sets. Pandas for Everyone, 2nd Edition, brings together practical knowledge and insight for solving real problems with Pandas, even if youre new to Python data analysis. Daniel Y. Chen introduces key concepts through simple but practical examples, incrementally building on them to solve more difficult, real-world data science problems such as using regularization to prevent data overfitting, or when to use unsupervised machine learning methods to find the underlying structure in a data set. New features to the second edition include: Extended coverage of plotting and the seaborn data visualization library Expanded examples and resources Updated Python 3.9 code and packages coverage, including statsmodels and scikit-learn libraries Online bonus material on geopandas, Dask, and creating interactive graphics with Altair Chen gives you a jumpstart on using Pandas with a realistic data set and covers combining data sets, handling missing data, and structuring data sets for easier analysis and visualization. He demonstrates powerful data cleaning techniques, from basic string manipulation to applying functions simultaneously across dataframes. Once your data is ready, Chen guides you through fitting models for prediction, clustering, inference, and exploration. He provides tips on performance and scalability and introduces you to the wider Python data analysis ecosystem. Work with DataFrames and Series, and import or export data Create plots with matplotlib, seaborn, and pandas Combine data sets and handle missing data Reshape, tidy, and clean data sets so theyre easier to work with Convert data types and manipulate text strings Apply functions to scale data manipulations Aggregate, transform, and filter large data sets with groupby Leverage Pandas advanced date and time capabilities Fit linear models using statsmodels and scikit-learn libraries Use generalized linear modeling to fit models with different response variables Compare multiple models to select the best one Regularize to overcome overfitting and improve performance Use clustering in unsupervised machine learning ...

Data Visualization with Python and JavaScript, 2nd Edition

2022-12-14 · O'Reilly Data Visualization Books O'Reilly Amazon

book

by Kyran Dale

API DataViz HTML JavaScript Matplotlib NumPy Pandas Plotly Seaborn data data-science data-science-tasks +1 more

How do you turn raw, unprocessed, or malformed data into dynamic, interactive web visualizations? In this practical book, author Kyran Dale shows data scientists and analysts--as well as Python and JavaScript developers--how to create the ideal toolchain for the job. By providing engaging examples and stressing hard-earned best practices, this guide teaches you how to leverage the power of best-of-breed Python and JavaScript libraries. Python provides accessible, powerful, and mature libraries for scraping, cleaning, and processing data. And while JavaScript is the best language when it comes to programming web visualizations, its data processing abilities can't compare with Python's. Together, these two languages are a perfect complement for creating a modern web-visualization toolchain. This book gets you started. You'll learn how to: Obtain data you need programmatically, using scraping tools or web APIs: Requests, Scrapy, Beautiful Soup Clean and process data using Python's heavyweight data processing libraries within the NumPy ecosystem: Jupyter notebooks with pandas+Matplotlib+Seaborn Deliver the data to a browser with static files or by using Flask, the lightweight Python server, and a RESTful API Pick up enough web development skills (HTML, CSS, JS) to get your visualized data on the web Use the data you've mined and refined to create web charts and visualizations with Plotly, D3, Leaflet, and other libraries

Convert Your Unstructured Data To Embedding Vectors For More Efficient Machine Learning With Towhee

2022-12-12 · Data Engineering Podcast Listen

podcast_episode

by Frank Liu , Tobias Macey

AI/ML Data Engineering Data Management GitHub

Preamble This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.

Summary Data is one of the core ingredients for machine learning, but the format in which it is understandable to humans is not a useful representation for models. Embedding vectors are a way to structure data in a way that is native to how models interpret and manipulate information. In this episode Frank Liu shares how the Towhee library simplifies the work of translating your unstructured data assets (e.g. images, audio, video, etc.) into embeddings that you can use efficiently for machine learning, and how it fits into your workflow for model development.

Announcements

Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started! Your host is Tobias Macey and today I’m interviewing Frank Liu about how to use vector embeddings in your ML projects and how Towhee can reduce the effort involved

Interview

Introduction How did you get involved in machine learning? Can you describe what Towhee is and the story behind it? What is the problem that Towhee is aimed at solving? What are the elements of generating vector embeddings that pose the greatest challenge or require the most effort? Once you have an embedding, what are some of the ways that it might be used in a machine learning project?

Are there any design considerations that need to be addressed in the form that an embedding takes and how it impacts the resultant model that relies on it? (whether for training or inference)

Can you describe how the Towhee framework is implemented?

What are some of the interesting engineering challenges that needed to be addressed? How have the design/goals/scope of the project shifted since it began?

What is the workflow for someone using Towhee in the context of an ML project? What are some of the types optimizations that you have incorporated into Towhee?

What are some of the scaling considerations that users need to be aware of as they increase the volume or complexity of data that they are processing?

What are some of the ways that using Towhee impacts the way a data scientist or ML engineer approach the design development of their model code? What are the interfaces available for integrating with and extending Towhee? What are the most interesting, innovative, or unexpected ways that you have seen Towhee used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Towhee? When is Towhee the wrong choice? What do you have planned for the future of Towhee?

Contact Info

LinkedIn fzliu on GitHub Website @frankzliu on Twitter

Parting Question

From your perspective, what is the biggest barrier to adoption of machine learning today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.init covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the

Python Data Science Handbook, 2nd Edition

2022-12-08 · O'Reilly Data Science Books O'Reilly Amazon

book

by Jake VanderPlas

AI/ML Data Science Matplotlib NumPy Pandas Scikit-learn programming-languages software-development

Python is a first-class tool for many researchers, primarily because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the new edition of Python Data Science Handbook do you get them all—IPython, NumPy, pandas, Matplotlib, Scikit-Learn, and other related tools. Working scientists and data crunchers familiar with reading and writing Python code will find the second edition of this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python. With this handbook, you'll learn how: IPython and Jupyter provide computational environments for scientists using Python NumPy includes the ndarray for efficient storage and manipulation of dense data arrays Pandas contains the DataFrame for efficient storage and manipulation of labeled/columnar data Matplotlib includes capabilities for a flexible range of data visualizations Scikit-learn helps you build efficient and clean Python implementations of the most important and established machine learning algorithms

The Art of Data-Driven Business

2022-12-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Alan Bernardo Palacio

AI/ML Analytics Data Science Marketing Pandas Seaborn data data-science data-science-tools

Learn how to integrate data-driven methodologies and machine learning into your business decision-making processes with 'The Art of Data-Driven Business.' This comprehensive guide shows you how to apply Python-based machine learning techniques to real-world challenges, transforming your organization into an innovative and well-informed enterprise. What this Book will help me do Create professional-quality data visualizations using Python's seaborn library to derive business insights. Analyze customer behavior, including predicting churn, with machine learning techniques. Apply clustering algorithms to segment customers for targeted marketing campaigns. Utilize pandas effectively for pricing and sales analytics to optimize your pricing strategies. Forecast outcomes of promotional strategies to determine costs and benefits and maximize performance. Author(s) None Palacio is an experienced data scientist and educator who specializes in the application of machine learning to solve business problems. With extensive real-world industry experience, Palacio brings practical insights and methodologies to learners. Their teaching connects technical knowledge to actionable business strategies. Who is it for? This book is ideal for business professionals aiming to incorporate data science into their strategies and technical experts seeking to leverage machine learning for business scenarios. Beginners to Python can find foundational help, while data scientists will appreciate the focused practical applications. It's perfect for individuals seeking a strong data-driven perspective in marketing, sales, and customer management.

talk-data.com

Activity Trend

Top Events

Top Speakers

Exploring The Nuances Of Building An Intentional Data Culture

Applied Geospatial Data Science with Python

Building A Data Mesh Platform At PayPal

Experimentation for Engineers

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Data Mining and Predictive Analytics for Business Decisions

Let The Whole Team Participate In Data With The Quilt Versioned Data Hub

45: 3-Step Guide To Building Your First Data Science Project

Reflecting On The Past 6 Years Of Data Engineering

Graph Data Science with Neo4j

Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI

Episode 112: 2022 Retro & Running!

Essential Math for AI

Episode CX: Compiler Diagnostics

Pandas for Everyone: Python Data Analysis, 2nd Edition

Data Visualization with Python and JavaScript, 2nd Edition

Convert Your Unstructured Data To Embedding Vectors For More Efficient Machine Learning With Towhee

Python Data Science Handbook, 2nd Edition

The Art of Data-Driven Business