In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes. In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments. Additionally, we’ll cover how this approach supports over 10 engineers working simultaneously on different feature branches within the same repo. We will highlight the benefits, such as conflict-free development and testing, and eliminating concerns about data corruption when running DAGs on production Airflow servers. Join me to discover how solutions like Astronomer empower developers to work with increased efficiency and reliability. This talk is perfect for those interested in big data, cloud solutions, and innovative development practices.
talk-data.com
Topic
Spark
Apache Spark
581
tagged
Activity Trend
Top Events
Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases. We will demonstrate how data scientists can use a Jupyter extension to easily build and manage such pipelines which are executed using Airflow streamlining data science workflow development and supercharging productivity
Summary
Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou
Interview
Introduction How did you get involved in the area of data management? Can you describe what Microsoft Fabric is and the story behind it? Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend? Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution?
What are the elements of Fabric that were engineered specifically for the service? What are the most interesting/complicated integration challenges?
How has your prior experience with Ahana and Presto informed your current work at Microsoft? AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine?
What are the challenges in terms of safety and reliability?
What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically? When is Fabric the wrong choice? What do you have planned for the future of data lake analytics?
Contact Info
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
Links
Microsoft Fabric Ahana episode DB2 Distributed Spark Presto Azure Data MAD Landscape
Podcast Episode ML Podcast Episode
Tableau dbt Medallion Architecture Microsoft Onelake ORC Parquet Avro Delta Lake Iceberg
Podcast Episode
Hudi
Podcast Episode
Hadoop PowerBI
Podcast Episode
Velox Gluten Apache XTable GraphQL Formula 1 McLaren
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Starburst: 
This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by T
Reynold Xin, Co-founder and Chief Architect, Databricks shares the latest innovation coming out of the Apache Spark™ open source project including a preview of the anticipated release of Spark 4.0
Speakers: Reynold Xin, Co-founder and Chief Architect, Databricks Tareef Kawaf, President, Posit Sofware, PBC
Speaker: Matei Zaharia, Original Creator of Apache Spark™ and MLflow; Chief Technologist, Databricks
Matei Zaharia, Original Creator of Apache Spark™ and MLflow and Chief Technologist at Databricks open sourced Unity Catalog live onstage at the Data + AI Summit 2024 in San Francisco.
Speakers: Matei Zaharia, Original Creator of Apache Spark™ and MLflow; Chief Technologist, Databricks Darshana Sivakumar, Staff Product Manager, Databricks
Organizations are looking for ways to securely exchange their data and collaborate with external partners to foster data-driven innovations. In the past, organizations had limited data sharing solutions, relinquishing control over how their sensitive data was shared with partners and little to no visibility into how their data was consumed. This created the risk for potential data misuse and data privacy breaches. Customers who tried using other clean room solutions have told us these solutions are limited and do not meet their needs, as they often require all parties to copy their data into the same platform, do not allow sophisticated analysis beyond basic SQL queries, and have limited visibility or control over their data.
Organizations need an open, flexible, and privacy-safe way to collaborate on data, and Databricks Clean Rooms meets these critical needs.
See a demo of Databricks Clean Rooms, now in Public Preview on AWS + Azure
Speaker: Matei Zaharia, Original Creator of Apache Spark™ and MLflow; Chief Technologist, Databricks
Summary: Data sharing and collaboration are important aspects of the data space. Matei Zaharia explains the evolution of the Databricks data platform to facilitate data sharing and collaboration for customers and their partners.
Delta Sharing allows you to share parts of your table with third parties authorized to view them. Over 16,000 data recipients use Delta Sharing, and 40% are not on Databricks—a testament to the open nature.
Databricks Marketplace has been growing rapidly and now has over 2,000 data listings, making it one of the largest data marketplaces available. New Marketplace partners include T-Mobile, Tableau, Atlassian, Epsilon, Shutterstock and more.
To learn more about Delta Sharing features and the expansion of partner sharing ecosystem, see the recent blog: https://www.databricks.com/blog/whats-new-data-sharing-and-collaboration
Speaker: Matei Zaharia, Original Creator of Apache Spark™ and MLflow; Chief Technologist, Databricks
Reynold Xin explains the evolution of Apache Spark™, outlining several historical challenges and how the Spark community worked to make improvements, including the addition of PySpark.
Speaker: Reynold Xin, Co-founder and Chief Architect at Databricks
Databricks Co-founder and Chief Architect, Reynold Xin, on the evolution of Apache Spark™ and what's next, including Spark Connect and a preview of Apache Spark™ 4.0
Speaker: Reynold Xin, Co-founder and Chief Architect, Databricks
Summary
Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse
Interview
Introduction How did you get involved in the area of data management? Can you describe what role Trino and Iceberg play in Stripe's data architecture?
What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure?
What were the requirements and selection criteria that led to the selection of that combination of technologies?
What are the other systems that feed into and rely on the Trino/Iceberg service?
what kinds of questions are you answering with table metadata
what use case/team does that support
comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure? When is a lakehouse on Trino/Iceberg the wrong choice? What do you have planned for the future of Trino and Iceberg at Stripe?
Contact Info
Substack LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
Links
Trino Iceberg Stripe Spark Redshift Hive Metastore Python Iceberg Python Iceberg REST Catalog Trino Metadata Table Flink
Podcast Episode
Tabular
Podcast Episode
Delta Table
Podcast Episode
Databricks Unity Catalog Starburst AWS Athena Kevin Trinofest Presentation Alluxio
Podcast Episode
Parquet Hudi Trino Project Tardigrade Trino On Ice
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Starburst: 
This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake.
Trusted by the teams at Comcast and Doordash, Starburst del
Speakers: - Alexander Booth, Asst Director of Research & Development, Texas Rangers - Ali Ghodsi, Co-Founder and CEO, Databricks - Bilal Aslam, Sr. Director of Product Management, Databricks - Darshana Sivakumar, Staff Product Manager, Databricks - Hannes Mühleisen, Creator of DuckDB, DuckDB Labs - Matei Zaharia, Chief Technology Officer and Co-Founder, Databricks - Reynold Xin, Chief Architect and Co-Founder, Databricks - Ryan Blue, CEO, Tabular - Tareef Kawaf, President, Posit Software, PBC - Yejin Choi, Sr Research Director Commonsense AI, AI2, University of Washington - Zeashan Pappa, Staff Product Manager, Databricks
About Databricks Databricks is the Data and AI company. More than 10,000 organizations worldwide — including Block, Comcast, Conde Nast, Rivian, and Shell, and over 60% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to take control of their data and put it to work with AI. Databricks is headquartered in San Francisco, with offices around the globe, and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow.
Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data… Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc
This book serves as the ultimate preparation for aspiring Databricks Certified Associate Developers specializing in Apache Spark. Deep dive into Spark's components, its applications, and exam techniques to achieve certification and expand your practical skills in big data processing and real-time analytics using Python. What this Book will help me do Deeply understand Apache Spark's core architecture for building big data applications. Write optimized SQL queries and leverage Spark DataFrame API for efficient data manipulation. Apply advanced Spark functions, including UDFs, to solve complex data engineering tasks. Use Spark Streaming capabilities to implement real-time and near-real-time processing solutions. Get hands-on preparation for the certification exam with mock tests and practice questions. Author(s) Saba Shah is a seasoned data engineer with extensive experience working at Databricks and leading data science teams. With her in-depth knowledge of big data applications and Spark, she delivers clear, actionable insights in this book. Her approach emphasizes practical learning and real-world applications. Who is it for? This book is ideal for data professionals such as engineers and analysts aiming to achieve Databricks certification. It is particularly helpful for individuals with moderate Python proficiency who are keen to understand Spark from scratch. If you're transitioning into big data roles, this guide prepares you comprehensively.
Send us a text This is always a good interview; an interview from last year with Dr. Mindy Weinstein discussing how we try to reach humans through digital marketing, and the power of scarcity. I am reminded of this concept every time I look at the resell price of Taylor Swift tickets.
Marketing : The Power of Scarcity with Mindy Weinstein, Founder and CEO of Market MindShift, Marketing instructor for Grand Canyon University, Columbia Business School, and Wharton. "Trying to reach humans" through digital marketing. Original episode season 7 Episode 5. 01:41 Meet "Marketer" Mindy Weinstein04:42 Technology in Marketing07:50 One of the top women in digital marketing09:12 The power of scarcity19:16 Four types of scarcity20:41 Bourbon scarcity21:47 Businesses leveraging scarcityLinkedIn: linkedin.com/in/mindydweinstein Website: https://www.persuasioninbusiness.com/book, https://www.marketmindshift.com/
Summary of Dr. Weinstein's book: Drive revenue and grow your business by using the powerful concept of scarcity Scarcity isn't just one of the key principles of influence, it's arguably the most powerful―invoking the kind of primal instincts that were essential to our ancestors' survival. It's also the explanation for why, in the mid-1990's, $29.99 Tickle-Me-Elmo dolls were being scalped for $7,000 apiece. And yet, for all its power, scarcity is a principle that's little understood, even as it's frequently employed in sales and marketing campaigns. Research on scarcity is published mainly in academic journals, not easily accessible to the mainstream public, and often written from an economic, rather than psychological, point of view. In The Power of Scarcity, Dr. Mindy Weinstein leverages her deep expertise in both marketing and psychology to reveal how this influence principle can be used to boost sales, win negotiations, spark action, develop community, build customer loyalty, and more. As a digital marketer and doctor of philosophy in psychology, she brings both practical and academic insights to explain the psychology behind scarcity, why it has such an immense impact on decision making, and how, used correctly and ethically, it can influence the people who buy your products or services. In these pages, you'll gain a deeper understanding of why and how scarcity works in business, and specifically how different types of scarcity messages―supply related, demand related, time related or limited edition―affect our brains. You'll see it in action from multiple perspectives, through case studies, research findings, and eye-opening interviews with current and former executives (from brands that include McDonald’s, Harry & David, and 1-800-Flowers), as well as real-life customers' firsthand experiences. For anyone involved in sales and marketing today, The Power of Scarcity is a rare find, combining the best research on the subject as well as hands-on, tactical ways to apply the psychology behind it to knowledgeably harness that power to bolster your business. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.
In "Data Engineering with Databricks Cookbook," you'll learn how to efficiently build and manage data pipelines using Apache Spark, Delta Lake, and Databricks. This recipe-based guide offers techniques to transform, optimize, and orchestrate your data workflows. What this Book will help me do Master Apache Spark for data ingestion, transformation, and analysis. Learn to optimize data processing and improve query performance with Delta Lake. Manage streaming data processing with Spark Structured Streaming capabilities. Implement DataOps and DevOps workflows tailored for Databricks. Enforce data governance policies using Unity Catalog for scalable solutions. Author(s) Pulkit Chadha, the author of this book, is a Senior Solutions Architect at Databricks. With extensive experience in data engineering and big data applications, he brings practical insights into implementing modern data solutions. His educational writings focus on empowering data professionals with actionable knowledge. Who is it for? This book is ideal for data engineers, data scientists, and analysts who want to deepen their knowledge in managing and transforming large datasets. Readers should have an intermediate understanding of SQL, Python programming, and basic data architecture concepts. It is especially well-suited for professionals working with Databricks or similar cloud-based data platforms.
This book is your gateway to mastering the skills required for achieving the Azure Data Engineer Associate certification (DP-203). Whether you're new to the field or a seasoned professional, it comprehensively prepares you for the challenges of the exam. Learn to design and implement advanced data solutions, secure sensitive information, and optimize data processes effectively. What this Book will help me do Understand and utilize Azure's data services such as Azure Synapse and Azure Databricks for data processing. Master advanced data storage and management solutions, including designing partitions and lake architectures. Learn to secure data with state-of-the-art tools like RBAC, encryption, and Azure Purview. Develop and manage data pipelines and workflows using tools like Azure Data Factory (ADF) and Spark. Prepare for and confidently pass the DP-203 certification exam with the included practical resources and guidance. Author(s) The authors, None Palmieri, Surendra Mettapalli, and None Alex, bring a wealth of expertise in cloud and data engineering. With extensive industry experience, they've designed this guide to be both educational and practical, enabling learners to not only understand but also apply concepts in real-world scenarios. Their goal is to make complex topics approachable, supporting your journey to certification success. Who is it for? This guide is perfect for aspiring and current data engineers aiming to achieve the Azure Data Engineer Associate certification (DP-203). It's particularly useful for professionals familiar with cloud services and basic data engineering concepts who want to delve deeper into Azure's offerings. Additionally, managers and learners preparing for roles involving Azure cloud data solutions will find the content invaluable for career advancement.
Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that flow like your morning coffee, where industry insights meet laid-back banter. Whether you're a data aficionado or just curious about the digital age, pull up a chair and let's explore the heart of data, unplugged style!
Stack Overflow and OpenAI Deal Controversy: Discussing the partnership controversy, with users protesting the lack of an opt-out option and how this could reshape the platform. Look into Phind here.Apple and OpenAI Rumors - could ChatGPT be the new Siri? Examining rumors of ChatGPT potentially replacing Siri, and Apple's AI strategy compared to Microsoft’s MAI-1. Check out more community opinions here.Hello GPT-4o: Exploring the new era with OpenAI's GPT-4o that blends video, text, and audio for more dynamic human-AI interactions. Discussing AI's challenges under the European AI Act and chatgpt’s use in daily life and dating apps like Bumble.Claude Takes Europe: Claude 3 now available in the EU. How does it compare to ChatGPT in coding and conversation?ElevenLabs' Music Generation AI: A look at ElevenLabs' AI for generating music and the broader AI music landscape. How are these algorithms transforming music creation? Check out the AI Song Contest here.Google Cloud’s Big Oops with UniSuper: Unpack the shocking story of how Google Cloud accidentally wiped out UniSuper’s account. What does this mean for data security and redundancy strategies?The Great CLI Debate: Is Python really the right choice for CLI tools? We spark the debate over Python vs. Go and Rust in building efficient CLI tools.
Radically improve the quality of your data visualizations by employing core principles of color, typography, chart types, data storytelling, and more. Everyday Data Visualization is a field guide for design techniques that will improve the charts, reports, and data dashboards you build every day. Everything you learn is tool-agnostic, with universal principles you can apply to any data stack. In Everyday Data Visualization you’ll learn important design principles for the most common data visualizations: Harness the power of perception to guide a user’s attention Bring data to life with color and typography Choose the best chart types for your data story Design for interactive visualizations Keep the user’s needs first throughout your projects This book gives you the tools you need to bring your data to life with clarity, precision, and flair. You’ll learn how human brains perceive and process information, wield modern accessibility standards, get the basics of color theory and typography, and more. About the Technology Even mundane presentations like charts, dashboards, and infographics can become engaging and inspiring data stories! This book shows you how to upgrade the visualizations you create every day by improving the layout, typography, color, and accessibility. You’ll discover timeless principles of design that help you highlight important features, compensate for missing information, and interact with live data flows. About the Book Everyday Data Visualization guides you through basic graphic design for the most common types of data visualization. You’ll learn how to enhance charts with color, encourage users to interact and explore data and create visualizations accessible to everyone. Along the way, you’ll practice each new skill as you take a dashboard project from research to publication. What's Inside Bring data to life with color and typography Choose the best chart types for your data story Design interactive visualizations About the Reader For readers experienced with data analysis tools. About the Author Desireé Abbott has over a decade of experience in product analytics, business intelligence, science, design, and software engineering. The technical editor on this book was Michael Petrey. Quotes A delightful blend of data viz principles, guidance, and design tips. The treasure trove of insights I wish I had years ago! - Alli Torban, Author of Chart Spark With vibrant enthusiasm and engaging conversational style, this book shines. - RJ Andrews, data storyteller Elegantly simplifies complex concepts, making them accessible even to beginners. An enlightening journey. - Renato Sinohara, Westwing Group SE Desiree’s approachable writing style makes it easy to dive straight into this book, and you’re in deep before you even know it. I guarantee you’ll learn plenty. - Neil Richards, 5xTableau Visionary, Author of Questions in Dataviz
This episode features Alli Torban, a leading data information designer, sharing her career journey from a data analyst to teaching data visualization to companies like Google and Moderna.
Alli advises on becoming a data viz designer, emphasizing the significance of data literacy, tool mastery, and building a portfolio with personal projects.
Connect with Alli Torban :
🤝 Follow on Linkedin
📔 Learn About Chart Spark
🧙♂️ Ace the Interview with Confidence
📩 Get my weekly email with helpful data career tips
📊 Come to my next free “How to Land Your First Data Job” training
🏫 Check out my 10-week data analytics bootcamp
Timestamps:
(08:16) Alli's Transition to Freelance and Starting Her Own Company (17:40) Advice for Aspiring Data Visualization Designers (21:42) Unlocking Creativity with Practical Inspiration and Prompts
Connect with Avery:
📺 Subscribe on YouTube
🎙Listen to My Podcast
👔 Connect with me on LinkedIn
🎵 TikTok
Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!
To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more
If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.
👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa
Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool—a cost-prohibitive process for making warehouse features available to all of your data. The lack of flexibility with these patterns requires you to lock into a set of priority tools and formats, which creates data silos and data drift. This practical book shows you a better way. Apache Iceberg provides the capabilities, performance, scalability, and savings that fulfill the promise of an open data lakehouse. By following the lessons in this book, you'll be able to achieve interactive, batch, machine learning, and streaming analytics with this high-performance open source format. Authors Tomer Shiran, Jason Hughes, and Alex Merced from Dremio show you how to get started with Iceberg. With this book, you'll learn: The architecture of Apache Iceberg tables What happens under the hood when you perform operations on Iceberg tables How to further optimize Iceberg tables for maximum performance How to use Iceberg with popular data engines such as Apache Spark, Apache Flink, and Dremio Discover why Apache Iceberg is a foundational technology for implementing an open data lakehouse.