In this course, you'll learn concepts and perform labs that showcase workflows using Unity Catalog - Databricks' unified and open governance solution for data and AI. We'll start off with a brief introduction to Unity Catalog, discuss fundamental data governance concepts, and then dive into a variety of topics including using Unity Catalog for data access control, managing external storage and tables, data segregation, and more. Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.) Labs: Yes Certification Path: Databricks Certified Data Engineer Associate
talk-data.com
Topic
Spark
Apache Spark
581
tagged
Activity Trend
Top Events
In this course, you’ll learn how to orchestrate data pipelines with Lakeflow Jobs (previously Databricks Workflows) and schedule dashboard updates to keep analytics up-to-date. We’ll cover topics like getting started with Lakeflow Jobs, how to use Databricks SQL for on-demand queries, and how to configure and schedule dashboards and alerts to reflect updates to production data pipelines. Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.) Labs: No Certification Path: Databricks Certified Data Engineer Associate
In this course, you’ll learn how to define and schedule data pipelines that incrementally ingest and process data through multiple tables on the Data Intelligence Platform, using Lakeflow Declarative Pipelines in Spark SQL and Python. We’ll cover topics like how to get started with Lakeflow Declarative Pipelines, how Lakeflow Declarative Pipelines tracks data dependencies in data pipelines, how to configure and run data pipelines using the Lakeflow Declarative Pipelines. UI, how to use Python or Spark SQL to define data pipelines that ingest and process data through multiple tables on the Data Intelligence Platform, using Auto Loader and Lakeflow Declarative Pipelines, how to use APPLY CHANGES INTO syntax to process Change Data Capture feeds, and how to review event logs and data artifacts created by pipelines and troubleshoot syntax.By streamlining and automating reliable data ingestion and transformation workflows, this course equips you with the foundational data engineering skills needed to help kickstart AI use cases. Whether you're preparing high-quality training data or enabling real-time AI-driven insights, this course is a key step in advancing your AI journey.Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.)Labs: NoCertification Path: Databricks Certified Data Engineer Associate
In this course, you’ll learn how to optimize workloads and physical layout with Spark and Delta Lake and and analyze the Spark UI to assess performance and debug applications. We’ll cover topics like streaming, liquid clustering, data skipping, caching, photons, and more. Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Labs: Yes Certification Path: Databricks Certified Data Engineer Professional
In this course, you’ll learn how to have efficient data ingestion with Lakeflow Connect and manage that data. Topics include ingestion with built-in connectors for SaaS applications, databases and file sources, as well as ingestion from cloud object storage, and batch and streaming ingestion. We'll cover the new connector components, setting up the pipeline, validating the source and mapping to the destination for each type of connector. We'll also cover how to ingest data with Batch to Streaming ingestion into Delta tables, using the UI with Auto Loader, automating ETL with Lakeflow Declarative Pipelines or using the API.This will prepare you to deliver the high-quality, timely data required for AI-driven applications by enabling scalable, reliable, and real-time data ingestion pipelines. Whether you're supporting ML model training or powering real-time AI insights, these ingestion workflows form a critical foundation for successful AI implementation.Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.Labs: NoCertification Path: Databricks Certified Data Engineer Associate
The course intends to equip professional-level machine learning practitioners with knowledge and hands-on experience in utilizing Apache Spark™ for machine learning purposes, including model fine-tuning. Additionally, the course covers using the Pandas library for scalable machine learning tasks. The initial section of the course focuses on comprehending the fundamentals of Apache Spark™ along with its machine learning capabilities. Subsequently, the second section delves into fine-tuning models using the hyperopt library. The final segment involves learning the implementation of the Pandas API within Apache Spark™, encompassing guidance on Pandas UDFs (User-Defined Functions) and the Functions API for model inference. Pre-requisites: Familiarity with Databricks workspace and notebooks; knowledge of machine learning model development and deployment with MLflow (e.g. basic understanding of DS/ML concepts, common model metrics and python libraries as well as a basic understanding of scaling workloads with Spark) Labs: Yes Certification Path: Databricks Certified Machine Learning Professional
It's now easier than ever for less technical users to access, manage and analyze data without needing help from IT. But, self-service data management isn't always straightforward, and there are plenty of pitfalls, like data quality issues, skills gaps and governance concerns. This session will cover practical ways to make self-service data management work.
Send us a text We’re back for Part 2 of our Automation deep-dive—and the hits just keep coming! Host Al Martin reunites with IBM automation aces Sarah McAndrew (WW Automation Technical Sales) and Vikram Murali (App Mod & IT Automation Development) to push past the hype and map out the road ahead. 🎬 Episode Highlights 00:12 Observability – why seeing everything is half the battle04:17 IBM Concert – orchestrating dev, ops & business in one score07:42 Tech vs. Culture – the million-dollar question11:34 Real-world use cases that ship value today13:65 Scanning the Future of Automation (spoiler: it’s closer than you think)15:32 Hashi – tooling that scales with you17:42 Top resources to learn more and stay ahead of the curve18:13 Lightning-round fun to wrap it upWhether you’re wrangling legacy systems or architecting cloud-native dreams, this conversation will spark ideas, bust myths, and give you action items of immediate value. Smash that play button, tag a colleague, and join the movement to #MakingDataSimple! 🔗 Connect: Sarah McAndrew LinkedIn | Vikram Murali LinkedIn 🌐 Explore IBM Automation: ibm.com/automation
MakingDataSimple #IBMAutomation #Observability #TechPodcast #FutureOfWork
Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.
Summary In this episode of the Data Engineering Podcast Tulika Bhatt, a senior software engineer at Netflix, talks about her experiences with large-scale data processing and the future of data engineering technologies. Tulika shares her journey into the data engineering field, discussing her work at BlackRock and Verizon before joining Netflix, and explains the challenges and innovations involved in managing Netflix's impression data for personalization and user experience. She highlights the importance of balancing off-the-shelf solutions with custom-built systems using technologies like Spark, Flink, and Iceberg, and delves into the complexities of ensuring data quality and observability in high-speed environments, including robust alerting strategies and semantic data auditing.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Tulika Bhatt about her experiences working on large scale data processing and her insights on the future trajectory of the supporting technologiesInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the ways that operating at large scale change the ways that you need to think about the design of data systems?When dealing with small-scale data systems it can be feasible to have manual processes. What are the elements of large scal data systems that demand autopmation?How can those large-scale automation principles be down-scaled to the systems that the rest of the world are operating?A perennial problem in data engineering is that of data quality. The past 4 years has seen a significant growth in the number of tools and practices available for automating the validation and verification of data. In your experience working with high volume data flows, what are the elements of data validation that are still unsolved?Generative AI has taken the world by storm over the past couple years. How has that changed the ways that you approach your daily work?What do you see as the future realities of working with data across various axes of large scale, real-time, etc.?What are the most interesting, innovative, or unexpected ways that you have seen solutions to large-scale data management designed?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data management across axes of scale?What are the ways that you are thinking about the future trajectory of your work??Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BlackRockSparkFlinkKafkaCassandraRocksDBNetflix Maestro workflow orchestratorPagerdutyIcebergThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
This session dives into the world of on-demand Apache Spark on Google Cloud. We explore its native integration with BigQuery, its new capabilities and the benefits of using Spark for AI and machine learning (ML) workloads. We’ll discuss why Spark is a good choice for large-scale data processing, distributed training, and distributed inferencing. We’ll learn from Trivago about how they leveraged the Spark and BigQuery together to simplify their AI and ML workflows.
Overwhelmed by the complexities of building a robust and scalable data pipeline for algo trading with AlloyDB? This session provides the Google Cloud services, tools, recommendations, and best practices you need to succeed. We'll explore battle-tested strategies for implementing a low-latency, high-volume trading platform using AlloyDB and Spark Streaming on Dataproc.
Leverage Composer Orchestration to create a scalable and efficient data pipeline that meets the demands of algo trading and can handle increasing data volumes and trading activity by utilizing the scalability of Google Cloud services.
This session provides a comprehensive guide to building a secure and unified AI lakehouse on BigQuery with the power of open source software (OSS). We’ll explore essential components, including data ingestion, storage, and management; AI and machine learning workflows; pipeline orchestration; data governance; and operational efficiency. Learn about the newest features that support both Apache Spark and Apache Iceberg.
Modern analytics and AI workloads demand a unified storage layer for structured and unstructured data. Learn how Cloud Storage simplifies building data lakes based on Apache Iceberg. We’ll discuss storage best practices and new capabilities that enable high performance and cost efficiency. We’ll also guide you through real-world examples, including Iceberg data lakes with BigQuery or third-party solutions, data preparation for AI pipelines with Dataproc and Apache Spark, and how customers have built unified analytics and AI solutions on Cloud Storage.
NVIDIA GPUs accelerate batch ETL workloads at significant cost savings and performance. In this session, we will delve into optimizing Apache Spark on GCP Dataproc using the G2 accelerator-optimized series with L4 GPUs via RAPIDS Accelerator For Apache Spark, showcasing up to 14x speedups and 80% cost reductions for Spark applications. We will demonstrate this acceleration through a reference AI architecture on financial transaction fraud detection, and go through performance measurements.
Unstructured data makes up the majority of all new data; a trend that's been growing exponentially since 2018. At these volumes, vector embeddings require indexes to be trained so that nearest neighbors can be efficiently approximated, avoiding the need for exhaustive lookups. However, training these indexes puts intense demand on vector databases to maintain a high ingest throughput. In this session, we will explain how the NVIDIA cuVS library is turbo charging vector database ingest with GPUs, providing speedups from 5-20x and improving data readiness.
This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.
Send us a text ✨ Updated April 1, 2025 What do Tickle Me Elmo, bourbon, and Taylor Swift tickets have in common? Scarcity. And in the world of marketing, it's one of the most powerful forces you can harness. This week, we’re throwing it back to one of our most insightful interviews — a conversation with Dr. Mindy Weinstein, Founder and CEO of Market MindShift, marketing professor at Grand Canyon University, Columbia Business School, and Wharton, and author of The Power of Scarcity. We dig into: The psychology behind scarcity and why it drives us to act nowThe four types of scarcity (you’ll want to write these down!)How top brands — and yes, bourbon sellers — use scarcity to spark actionWhy "reaching humans" in digital marketing is more nuanced than everHow you can ethically and effectively use scarcity to boost business results📚 About the Book: In The Power of Scarcity, Dr. Weinstein combines her background in marketing and psychology to break down how scarcity messaging influences decision-making — and how you can leverage it to drive revenue, deepen loyalty, and create urgency without manipulation. With research, real-world examples, and interviews from brands like McDonald’s and 1-800-Flowers, it’s a must-read for anyone looking to up their marketing game. 📌 Timestamps: 01:41 Meet "Marketer" Mindy Weinstein 04:42 Technology in Marketing 07:50 One of the top women in digital marketing 09:12 The Power of Scarcity 19:16 The Four Types of Scarcity 20:41 Bourbon Scarcity 21:47 Businesses Leveraging Scarcity 🧠 Connect with Dr. Weinstein: 🔗 LinkedIn: linkedin.com/in/mindydweinstein 📘 Book: persuasioninbusiness.com 🌐 Website: marketmindshift.com 🎧 Originally aired: Season 7, Episode 5 Want to be featured on Making Data Simple? Drop us a note at [email protected] and tell us why you should be next! Hosted by Al Martin, WW VP Technical Sales at IBM — where we explore cutting-edge tech, business innovation, and the human side of leadership... all while keeping it simple & fun. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.
Time Series Analysis with Spark provides a practical introduction to leveraging Apache Spark and Databricks for time series analysis. You'll learn to prepare, model, and deploy robust and scalable time series solutions for real-world applications. From data preparation to advanced generative AI techniques, this guide prepares you to excel in big data analytics. What this Book will help me do Understand the core concepts and architectures of Apache Spark for time series analysis. Learn to clean, organize, and prepare time series data for big data environments. Gain expertise in choosing, building, and training various time series models tailored to specific projects. Master techniques to scale your models in production using Spark and Databricks. Explore the integration of advanced technologies such as generative AI to enhance predictions and derive insights. Author(s) Yoni Ramaswami, a Senior Solutions Architect at Databricks, has extensive experience in data engineering and AI solutions. With a focus on creating innovative big data and AI strategies across industries, Yoni authored this book to empower professionals to efficiently handle time series data. Yoni's approachable style ensures that both foundational concepts and advanced techniques are accessible to readers. Who is it for? This book is ideal for data engineers, machine learning engineers, data scientists, and analysts interested in enhancing their expertise in time series analysis using Apache Spark and Databricks. Whether you're new to time series or looking to refine your skills, you'll find both foundational insights and advanced practices explained clearly. A basic understanding of Spark is helpful but not required.
In this podcast episode, we talked with Bartosz Mikulski about Data Intensive AI.
About the Speaker: Bartosz is an AI and data engineer. He specializes in moving AI projects from the good-enough-for-a-demo phase to production by building a testing infrastructure and fixing the issues detected by tests. On top of that, he teaches programmers and non-programmers how to use AI. He contributed one chapter to the book 97 Things Every Data Engineer Should Know, and he was a speaker at several conferences, including Data Natives, Berlin Buzzwords, and Global AI Developer Days.
In this episode, we discuss Bartosz’s career journey, the importance of testing in data pipelines, and how AI tools like ChatGPT and Cursor are transforming development workflows. From prompt engineering to building Chrome extensions with AI, we dive into practical use cases, tools, and insights for anyone working in data-intensive AI projects. Whether you’re a data engineer, AI enthusiast, or just curious about the future of AI in tech, this episode offers valuable takeaways and real-world experiences.
0:00 Introduction to Bartosz and his background 4:00 Bartosz’s career journey from Java development to AI engineering 9:05 The importance of testing in data engineering 11:19 How to create tests for data pipelines 13:14 Tools and approaches for testing data pipelines 17:10 Choosing Spark for data engineering projects 19:05 The connection between data engineering and AI tools 21:39 Use cases of AI in data engineering and MLOps 25:13 Prompt engineering techniques and best practices 31:45 Prompt compression and caching in AI models 33:35 Thoughts on DeepSeek and open-source AI models 35:54 Using AI for lead classification and LinkedIn automation 41:04 Building Chrome extensions with AI integration 43:51 Comparing Cursor and GitHub Copilot for coding 47:11 Using ChatGPT and Perplexity for AI-assisted tasks 52:09 Hosting static websites and using AI for development 54:27 How blogging helps attract clients and share knowledge 58:15 Using AI to assist with writing and content creation
🔗 CONNECT WITH Bartosz LinkedIn: https://www.linkedin.com/in/mikulskibartosz/ Github: https://github.com/mikulskibartosz Website: https://mikulskibartosz.name/blog/
🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Check other upcoming events - https://lu.ma/dtc-events LinkedIn - https://www.linkedin.com/company/datatalks-club/ Twitter - https://twitter.com/DataTalksClub Website - https://datatalks.club/
A challenge I frequently hear about from subscribers to my insights mailing list is how to design B2B data products for multiple user types with differing needs. From dashboards to custom apps and commercial analytics / AI products, data product teams often struggle to create a single solution that meets the diverse needs of technical and business users in B2B settings. If you're encountering this issue, you're not alone!
In this episode, I share my advice for tackling this challenge including the gift of saying "no.” What are the patterns you should be looking out for in your customer research? How can you choose what to focus on with limited resources? What are the design choices you should avoid when trying to build these products? I’m hoping by the end of this episode, you’ll have some strategies to help reduce the size of this challenge—particularly if you lack a dedicated UX team to help you sort through your various user/stakeholder demands.
Highlights/ Skip to
The importance of proper user research and clustering “jobs to be done” around business importance vs. task frequency—ignoring the rest until your solution can show measurable value (4:29) What “level” of skill to design for, and why “as simple as possible” isn’t what I generally recommend (13:44) When it may be advantageous to use role or feature-based permissions to hide/show/change certain aspects, UI elements, or features (19:50) Leveraging AI and LLMs in-product to allow learning about the user and progressive disclosure and customization of UIs (26:44) Leveraging the “old” solution of rapid prototyping—which is now faster than ever with AI, and can accelerate learning (capturing user feedback) (31:14) 5 things I do not recommend doing when trying to satisfy multiple user types in your b2b AI or analytics product (34:14)
Quotes from Today’s Episode
If you're not talking to your users and stakeholders sufficiently, you're going to have a really tough time building a successful data product for one user – let alone for multiple personas. Listen for repeating patterns in what your users are trying to achieve (tasks they are doing). Focus on the jobs and tasks they do most frequently or the ones that bring the most value to their business. Forget about the rest until you've proven that your solution delivers real value for those core needs. It's more about understanding the problems and needs, not just the solutions. The solutions tend to be easier to design when the problem space is well understood. Users often suggest solutions, but it's our job to focus on the core problem we're trying to solve; simply entering in any inbound requests verbatim into JIRA and then “eating away” at the list is not usually a reliable strategy. (5:52) I generally recommend not going for “easy as possible” at the cost of shallow value. Instead, you’re going to want to design for some “mid-level” ability, understanding that this may make early user experiences with the product more difficult. Why? Oversimplification can mislead because data is complex, problems are multivariate, and data isn't always ideal. There are also “n” number of “not-first” impressions users will have with your product. This also means there is only one “first impression” they have. As such, the idea conceptually is to design an amazing experience for the “n” experiences, but not to the point that users never realize value and give up on the product. While I'd prefer no friction, technical products sometimes will have to have a little friction up front however, don't use this as an excuse for poor design. This is hard to get right, even when you have design resources, and it’s why UX design matters as thinking this through ends up determining, in part, whether users obtain the promise of value you made to them. (14:21) As an alternative to rigid role and feature-based permissions in B2B data products, you might consider leveraging AI and / or LLMs in your UI as a means of simplifying and customizing the UI to particular users. This approach allows users to potentially interrogate the product about the UI, customize the UI, and even learn over time about the user’s questions (jobs to be done) such that becomes organically customized over time to their needs. This is in contrast to the rigid buckets that role and permission-based customization present. However, as discussed in my previous episode (164 - “The Hidden UX Taxes that AI and LLM Features Impose on B2B Customers Without Your Knowledge”) designing effective AI features and capabilities can also make things worse due to the probabilistic nature of the responses GenAI produces. As such, this approach may benefit from a UX designer or researcher familiar with designing data products. Understanding what “quality” means to the user, and how to measure it, is especially critical if you’re going to leverage AI and LLMs to make the product UX better. (20:13) The old solution of rapid prototyping is even more valuable now—because it’s possible to prototype even faster. However, prototyping is not just about learning if your solution is on track. Whether you use AI or pencil and paper, prototyping early in the product development process should be framed as a “prop to get users talking.” In other words, it is a prop to facilitate problem and need clarity—not solution clarity. Its purpose is to spark conversation and determine if you're solving the right problem. As you iterate, your need to continually validate the problem should shrink, which will present itself in the form of consistent feedback you hear from end users. This is the point where you know you can focus on the design of the solution. Innovation happens when we learn; so the goal is to increase your learning velocity. (31:35) Have you ever been caught in the trap of prioritizing feature requests based on volume? I get it. It's tempting to give the people what they think they want. For example, imagine ten users clamoring for control over specific parameters in your machine learning forecasting model. You could give them that control, thinking you're solving the problem because, hey, that's what they asked for! But did you stop to ask why they want that control? The reasons behind those requests could be wildly different. By simply handing over the keys to all the model parameters, you might be creating a whole new set of problems. Users now face a "usability tax," trying to figure out which parameters to lock and which to let float. The key takeaway? Focus on addressing the frequency that the same problems are occurring across your users, not just the frequency a given tactic or “solution” method (i.e. “model” or “dashboard” or “feature”) appears in a stakeholder or user request. Remember, problems are often disguised as solutions. We've got to dig deeper and uncover the real needs, not just address the symptoms. (36:19)
Summary In this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by re-imagining the system architectureInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the main factors that contribute to performance challenges in data lake environments?The different components of open data processing systems have evolved from different starting points with different objectives. In your experience, how has that un-planned and un-synchronized evolution of the ecosystem hindered the capabilities and adoption of open technologies?The introduction of a new cross-cutting capability (e.g. Iceberg) has typically taken a substantial amount of time to gain support across different engines and ecosystems. What do you see as the point of highest leverage to improve the capabilities of the entire stack with the least amount of co-ordination?What was the motivating insight that led you to invest in the technology that powers Datapelago?Can you describe the system design of Datapelago and how it integrates with existing data engines?The growth in the generation and application of unstructured data is a notable shift in the work being done by data teams. What are the areas of overlap in the fundamental nature of data (whether structured, semi-structured, or unstructured) that you are able to exploit to bridge the processing gap?What are the most interesting, innovative, or unexpected ways that you have seen Datapelago used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datapelago?When is Datapelago the wrong choice?What do you have planned for the future of Datapelago?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links DatapelagoMIPS ArchitectureARM ArchitectureAWS NitroMellanoxNvidiaVon Neumann ArchitectureTPU == Tensor Processing UnitFPGA == Field-Programmable Gate ArraySparkTrinoIcebergPodcast EpisodeDelta LakePodcast EpisodeHudiPodcast EpisodeApache GlutenIntermediate RepresentationTuring CompletenessLLVMAmdahl's LawLSTM == Long Short-Term MemoryThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Data engineers proficient in Databricks are currently in high demand. As organizations gather more data than ever before, skilled data engineers on platforms like Databricks become critical to business success. The Databricks Data Engineer Associate certification is proof that you have a complete understanding of the Databricks platform and its capabilities, as well as the essential skills to effectively execute various data engineering tasks on the platform. In this comprehensive study guide, you will build a strong foundation in all topics covered on the certification exam, including the Databricks Lakehouse and its tools and benefits. You'll also learn to develop ETL pipelines in both batch and streaming modes. Moreover, you'll discover how to orchestrate data workflows and design dashboards while maintaining data governance. Finally, you'll dive into the finer points of exactly what's on the exam and learn to prepare for it with mock tests. Author Derar Alhussein teaches you not only the fundamental concepts but also provides hands-on exercises to reinforce your understanding. From setting up your Databricks workspace to deploying production pipelines, each chapter is carefully crafted to equip you with the skills needed to master the Databricks Platform. By the end of this book, you'll know everything you need to ace the Databricks Data Engineer Associate certification exam with flying colors, and start your career as a certified data engineer from Databricks! You'll learn how to: Use the Databricks Platform and Delta Lake effectively Perform advanced ETL tasks using Apache Spark SQL Design multi-hop architecture to process data incrementally Build production pipelines using Delta Live Tables and Databricks Jobs Implement data governance using Databricks SQL and Unity Catalog Derar Alhussein is a senior data engineer with a master's degree in data mining. He has over a decade of hands-on experience in software and data projects, including large-scale projects on Databricks. He currently holds eight certifications from Databricks, showcasing his proficiency in the field. Derar is also an experienced instructor, with a proven track record of success in training thousands of data engineers, helping them to develop their skills and obtain professional certifications.