talk-data.com talk-data.com

Topic

Spark

Apache Spark

big_data distributed_computing analytics

109

tagged

Activity Trend

71 peak/qtr
2020-Q1 2026-Q1

Activities

109 activities · Newest first

Before StatQuest became the go-to learning companion for millions of AI and ML practitioners… Before the “BAM! Double BAM! Triple BAM!” became a teaching tool that many learners adore...

There was just one guy in a genetics lab, trying desperately to explain his data analysis to coworkers so they didn't think he was working magic.

In this deeply personal and inspiring episode, Joshua Starmer (CEO & Founder | StatQuest) shares the real story behind his rise — a journey shaped by strategy, struggle, blunt feedback, and a relentless desire to make complicated ideas simple.

What you’ll discover: 🔹How Josh went from helping colleagues in a genetics lab to becoming a renowned educator, treasuring his first 9 views and 2 subscribers as a big win. 🔹How early feedback Josh received as a kid became a quiet spark — motivating him to improve how he explained things and ultimately shaping the teaching style millions now rely on. 🔹How his method for breaking down complex topics with unique tools like his iconic BAM! help make learning lighter and less intimidating. 🔹His thoughts on AI tutors, avatars, and interactive learning and how ethics, bias, and hallucinations relate to next-gen learning.

This is more than a conversation about statistics, data science, AI, education, or YouTube. It’s the story of a researcher who never imagined starting a learning platform, yet became one of the most trusted teachers in statistics and machine learning—turning frustration into clarity, confusion into curiosity, and small beginnings into a massive global impact.

📌 If you’ve ever struggled with PCA, logistic regression, K-means clustering, neural networks, or any tricky stats and ML concepts… chances are StatQuest made it click. Now, hear from the creator himself about what goes on behind the scenes. Now you’ll finally understand how he made it click.

🔹A must-listen for: AI/ML learners, data scientists, educators, content creators, self-taught enthusiasts, and anyone who’s faced the fear of “I’m not good at explaining things.”Prepare to walk away inspired — and with a renewed belief that clarity is a superpower anyone can learn.

I missed my parents, so I built an AI that talks like them. This isn’t about replacing people—it’s about remembering the voices that make us feel safe. In this 90-minute episode of Data & AI with Mukundan, we explore what happens when technology stops chasing efficiency and starts chasing empathy. Mukundan shares the story behind “What Would Mom & Dad Say?”, a Streamlit + GPT-4 experiment that generates comforting messages in the voice of loved ones. You’ll hear: The emotional spark that inspired the projectThe plain-English prompts anyone can use to teach AI empathyBoundaries & ethics of emotional AIHow this project reframed loneliness, creativity, and connectionTakeaway: AI can’t love you—but it can remind you of the people who do. 🔗 Try the free reflection prompts below: THE ONE-PROMPT VERSION: “What Would Mom & Dad Say?”
“You are speaking to me as one of my parents. Choose the tone I mention: either Mom (warm and reflective) or Dad (practical and encouraging). First, notice the emotion in what I tell you—fear, stress, guilt, joy, or confusion—and name it back to me so I feel heard. Then reply in 3 parts: Start by validating what I’m feeling, in a caring way.Share a short story, lesson, or perspective that fits the situation.End with one hopeful or guiding question that helps me think forward. Keep your words gentle, honest, and simple. No technical language. Speak like someone who loves me and wants me to feel calm and capable again.”

Join the Discussion (comments hub): https://mukundansankar.substack.com/notes Tools I use for my Podcast and Affiliate PartnersRecording Partner: Riverside → Sign up here (affiliate)Host Your Podcast: RSS.com (affiliate )Research Tools: Sider.ai (affiliate)Sourcetable AI: Join Here(affiliate)🔗 Connect with Me:Free Email NewsletterWebsite: Data & AI with MukundanGitHub: https://github.com/mukund14Twitter/X: @sankarmukund475LinkedIn: Mukundan SankarYouTube: Subscribe

Summary In this episode of the Data Engineering Podcast Hannes Mühleisen and Mark Raasveldt, the creators of DuckDB, share their work on Duck Lake, a new entrant in the open lakehouse ecosystem. They discuss how Duck Lake, is focused on simplicity, flexibility, and offers a unified catalog and table format compared to other lakehouse formats like Iceberg and Delta. Hannes and Mark share insights into how Duck Lake revolutionizes data architecture by enabling local-first data processing, simplifying deployment of lakehouse solutions, and offering benefits such as encryption features, data inlining, and integration with existing ecosystems.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Hannes Mühleisen and Mark Raasveldt about DuckLake, the latest entrant into the open lakehouse ecosystemInterview IntroductionHow did you get involved in the area of data management?Can you describe what DuckLake is and the story behind it?What are the particular problems that DuckLake is solving for?How does this compare to the capabilities of MotherDuck?Iceberg and Delta already have a well established ecosystem, but so does DuckDB. Who are the primary personas that you are trying to focus on in these early days of DuckLake?One of the major factors driving the adoption of formats like Iceberg is cost efficiency for large volumes of data. That brings with it challenges of large batch processing of data. How does DuckLake account for these axes of scale?There is also a substantial investment in the ecosystem of technologies that support Iceberg. The most notable ecosystem challenge for DuckDB and DuckLake is in the query layer. How are you thinking about the evolution and growth of that capability beyond DuckDB (e.g. support in Trino/Spark/Flink)?What are your opinions on the viability of a future where DuckLake and Iceberg become a unified standard and implementation? (why can't Iceberg REST catalog implementations just use DuckLake under the hood?)Digging into the specifics of the specification and implementation, what are some of the capabilities that it offers above and beyond Iceberg?Is it now possible to enforce PK/FK constraints, indexing on underlying data?Given that DuckDB has a vector type, how do you think about the support for vector storage/indexing?How do the capabilities of DuckLake and the integration with DuckDB change the ways that data teams design their data architecture and access patterns?What are your thoughts on the impact of "data gravity" in today's data ecosystem, with engines like DuckDB, KuzuDB, LanceDB, etc. available for embedded and edge use cases?What are the most interesting, innovative, or unexpected ways that you have seen DuckLake used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckLake?When is DuckLake the wrong choice?What do you have planned for the future of DuckLake?Contact Info HannesWebsiteMarkWebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DuckDBPodcast EpisodeDuckLakeDuckDB LabsMySQLCWIMonetDBIcebergIceberg REST CatalogDeltaHudiLanceDuckDB Iceberg ConnectorACID == Atomicity, Consistency, Isolation, DurabilityMotherDuckMotherDuck Managed DuckLakeTrinoSparkPrestoSpark DuckLake DemoDelta KernelArrowdltS3 TablesAttribute Based Access Control (ABAC)ParquetArrow FlightHadoopHDFSDuckLake RoadmapThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Send us a text We’re back for Part 2 of our Automation deep-dive—and the hits just keep coming! Host Al Martin reunites with IBM automation aces Sarah McAndrew (WW Automation Technical Sales) and Vikram Murali (App Mod & IT Automation Development) to push past the hype and map out the road ahead. 🎬 Episode Highlights 00:12 Observability – why seeing everything is half the battle04:17 IBM Concert – orchestrating dev, ops & business in one score07:42 Tech vs. Culture – the million-dollar question11:34 Real-world use cases that ship value today13:65 Scanning the Future of Automation (spoiler: it’s closer than you think)15:32  Hashi – tooling that scales with you17:42 Top resources to learn more and stay ahead of the curve18:13 Lightning-round fun to wrap it upWhether you’re wrangling legacy systems or architecting cloud-native dreams, this conversation will spark ideas, bust myths, and give you action items of immediate value. Smash that play button, tag a colleague, and join the movement to #MakingDataSimple!  🔗 Connect: Sarah McAndrew LinkedIn | Vikram Murali LinkedIn 🌐 Explore IBM Automation: ibm.com/automation

MakingDataSimple #IBMAutomation #Observability #TechPodcast #FutureOfWork

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Summary In this episode of the Data Engineering Podcast Tulika Bhatt, a senior software engineer at Netflix, talks about her experiences with large-scale data processing and the future of data engineering technologies. Tulika shares her journey into the data engineering field, discussing her work at BlackRock and Verizon before joining Netflix, and explains the challenges and innovations involved in managing Netflix's impression data for personalization and user experience. She highlights the importance of balancing off-the-shelf solutions with custom-built systems using technologies like Spark, Flink, and Iceberg, and delves into the complexities of ensuring data quality and observability in high-speed environments, including robust alerting strategies and semantic data auditing.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Tulika Bhatt about her experiences working on large scale data processing and her insights on the future trajectory of the supporting technologiesInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the ways that operating at large scale change the ways that you need to think about the design of data systems?When dealing with small-scale data systems it can be feasible to have manual processes. What are the elements of large scal data systems that demand autopmation?How can those large-scale automation principles be down-scaled to the systems that the rest of the world are operating?A perennial problem in data engineering is that of data quality. The past 4 years has seen a significant growth in the number of tools and practices available for automating the validation and verification of data. In your experience working with high volume data flows, what are the elements of data validation that are still unsolved?Generative AI has taken the world by storm over the past couple years. How has that changed the ways that you approach your daily work?What do you see as the future realities of working with data across various axes of large scale, real-time, etc.?What are the most interesting, innovative, or unexpected ways that you have seen solutions to large-scale data management designed?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data management across axes of scale?What are the ways that you are thinking about the future trajectory of your work??Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BlackRockSparkFlinkKafkaCassandraRocksDBNetflix Maestro workflow orchestratorPagerdutyIcebergThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Send us a text ✨ Updated April 1, 2025 What do Tickle Me Elmo, bourbon, and Taylor Swift tickets have in common? Scarcity. And in the world of marketing, it's one of the most powerful forces you can harness. This week, we’re throwing it back to one of our most insightful interviews — a conversation with Dr. Mindy Weinstein, Founder and CEO of Market MindShift, marketing professor at Grand Canyon University, Columbia Business School, and Wharton, and author of The Power of Scarcity. We dig into: The psychology behind scarcity and why it drives us to act nowThe four types of scarcity (you’ll want to write these down!)How top brands — and yes, bourbon sellers — use scarcity to spark actionWhy "reaching humans" in digital marketing is more nuanced than everHow you can ethically and effectively use scarcity to boost business results📚 About the Book:  In The Power of Scarcity, Dr. Weinstein combines her background in marketing and psychology to break down how scarcity messaging influences decision-making — and how you can leverage it to drive revenue, deepen loyalty, and create urgency without manipulation. With research, real-world examples, and interviews from brands like McDonald’s and 1-800-Flowers, it’s a must-read for anyone looking to up their marketing game. 📌 Timestamps: 01:41 Meet "Marketer" Mindy Weinstein 04:42 Technology in Marketing 07:50 One of the top women in digital marketing 09:12 The Power of Scarcity 19:16 The Four Types of Scarcity 20:41 Bourbon Scarcity 21:47 Businesses Leveraging Scarcity 🧠 Connect with Dr. Weinstein:  🔗 LinkedIn: linkedin.com/in/mindydweinstein 📘 Book: persuasioninbusiness.com 🌐 Website: marketmindshift.com 🎧 Originally aired: Season 7, Episode 5 Want to be featured on Making Data Simple? Drop us a note at [email protected] and tell us why you should be next! Hosted by Al Martin, WW VP Technical Sales at IBM — where we explore cutting-edge tech, business innovation, and the human side of leadership... all while keeping it simple & fun. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

In this podcast episode, we talked with Bartosz Mikulski about Data Intensive AI.

About the Speaker: Bartosz is an AI and data engineer. He specializes in moving AI projects from the good-enough-for-a-demo phase to production by building a testing infrastructure and fixing the issues detected by tests. On top of that, he teaches programmers and non-programmers how to use AI. He contributed one chapter to the book 97 Things Every Data Engineer Should Know, and he was a speaker at several conferences, including Data Natives, Berlin Buzzwords, and Global AI Developer Days. 

In this episode, we discuss Bartosz’s career journey, the importance of testing in data pipelines, and how AI tools like ChatGPT and Cursor are transforming development workflows. From prompt engineering to building Chrome extensions with AI, we dive into practical use cases, tools, and insights for anyone working in data-intensive AI projects. Whether you’re a data engineer, AI enthusiast, or just curious about the future of AI in tech, this episode offers valuable takeaways and real-world experiences.

0:00 Introduction to Bartosz and his background 4:00 Bartosz’s career journey from Java development to AI engineering 9:05 The importance of testing in data engineering 11:19 How to create tests for data pipelines 13:14 Tools and approaches for testing data pipelines 17:10 Choosing Spark for data engineering projects 19:05 The connection between data engineering and AI tools 21:39 Use cases of AI in data engineering and MLOps 25:13 Prompt engineering techniques and best practices 31:45 Prompt compression and caching in AI models 33:35 Thoughts on DeepSeek and open-source AI models 35:54 Using AI for lead classification and LinkedIn automation 41:04 Building Chrome extensions with AI integration 43:51 Comparing Cursor and GitHub Copilot for coding 47:11 Using ChatGPT and Perplexity for AI-assisted tasks 52:09 Hosting static websites and using AI for development 54:27 How blogging helps attract clients and share knowledge 58:15 Using AI to assist with writing and content creation

🔗 CONNECT WITH Bartosz LinkedIn: https://www.linkedin.com/in/mikulskibartosz/ Github: https://github.com/mikulskibartosz Website: https://mikulskibartosz.name/blog/

🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Check other upcoming events - https://lu.ma/dtc-events LinkedIn - https://www.linkedin.com/company/datatalks-club/ Twitter - https://twitter.com/DataTalksClub Website - https://datatalks.club/

A challenge I frequently hear about from subscribers to my insights mailing list is how to design B2B data products for multiple user types with differing needs. From dashboards to custom apps and commercial analytics / AI products, data product teams often struggle to create a single solution that meets the diverse needs of technical and business users in B2B settings. If you're encountering this issue, you're not alone!

In this episode, I share my advice for tackling this challenge including the gift of saying "no.” What are the patterns you should be looking out for in your customer research? How can you choose what to focus on with limited resources? What are the design choices you should avoid when trying to build these products? I’m hoping by the end of this episode, you’ll have some strategies to help reduce the size of this challenge—particularly if you lack a dedicated UX team to help you sort through your various user/stakeholder demands. 

Highlights/ Skip to 

The importance of proper user research and clustering “jobs to be done” around business importance vs. task frequency—ignoring the rest until your solution can show measurable value  (4:29) What “level” of skill to design for, and why “as simple as possible” isn’t what I generally recommend (13:44) When it may be advantageous to use role or feature-based permissions to hide/show/change certain aspects, UI elements, or features  (19:50) Leveraging AI and LLMs in-product to allow learning about the user and progressive disclosure and customization of UIs (26:44) Leveraging the “old” solution of rapid prototyping—which is now faster than ever with AI, and can accelerate learning (capturing user feedback) (31:14) 5 things I do not recommend doing when trying to satisfy multiple user types in your b2b AI or analytics product (34:14)

Quotes from Today’s Episode

If you're not talking to your users and stakeholders sufficiently, you're going to have a really tough time building a successful data product for one user – let alone for multiple personas. Listen for repeating patterns in what your users are trying to achieve (tasks they are doing). Focus on the jobs and tasks they do most frequently or the ones that bring the most value to their business. Forget about the rest until you've proven that your solution delivers real value for those core needs. It's more about understanding the problems and needs, not just the solutions. The solutions tend to be easier to design when the problem space is well understood. Users often suggest solutions, but it's our job to focus on the core problem we're trying to solve; simply entering in any inbound requests verbatim into JIRA and then “eating away” at the list is not usually a reliable strategy. (5:52) I generally recommend not going for “easy as possible” at the cost of shallow value. Instead, you’re going to want to design for some “mid-level” ability, understanding that this may make early user experiences with the product more difficult. Why? Oversimplification can mislead because data is complex, problems are multivariate, and data isn't always ideal. There are also “n” number of “not-first” impressions users will have with your product. This also means there is only one “first impression” they have. As such, the idea conceptually is to design an amazing experience for the “n” experiences, but not to the point that users never realize value and give up on the product.  While I'd prefer no friction, technical products sometimes will have to have a little friction up front however, don't use this as an excuse for poor design. This is hard to get right, even when you have design resources, and it’s why UX design matters as thinking this through ends up determining, in part, whether users obtain the promise of value you made to them. (14:21) As an alternative to rigid role and feature-based permissions in B2B data products, you might consider leveraging AI and / or LLMs in your UI as a means of simplifying and customizing the UI to particular users. This approach allows users to potentially interrogate the product about the UI, customize the UI, and even learn over time about the user’s questions (jobs to be done) such that becomes organically customized over time to their needs. This is in contrast to the rigid buckets that role and permission-based customization present. However, as discussed in my previous episode (164 - “The Hidden UX Taxes that AI and LLM Features Impose on B2B Customers Without Your Knowledge”)  designing effective AI features and capabilities can also make things worse due to the probabilistic nature of the responses GenAI produces. As such, this approach may benefit from a UX designer or researcher familiar with designing data products. Understanding what “quality” means to the user, and how to measure it, is especially critical if you’re going to leverage AI and LLMs to make the product UX better. (20:13) The old solution of rapid prototyping is even more valuable now—because it’s possible to prototype even faster. However, prototyping is not just about learning if your solution is on track. Whether you use AI or pencil and paper, prototyping early in the product development process should be framed as a “prop to get users talking.” In other words, it is a prop to facilitate problem and need clarity—not solution clarity. Its purpose is to spark conversation and determine if you're solving the right problem. As you iterate, your need to continually validate the problem should shrink, which will present itself in the form of consistent feedback you hear from end users. This is the point where you know you can focus on the design of the solution. Innovation happens when we learn; so the goal is to increase your learning velocity. (31:35) Have you ever been caught in the trap of prioritizing feature requests based on volume? I get it. It's tempting to give the people what they think they want. For example, imagine ten users clamoring for control over specific parameters in your machine learning forecasting model. You could give them that control, thinking you're solving the problem because, hey, that's what they asked for! But did you stop to ask why they want that control? The reasons behind those requests could be wildly different. By simply handing over the keys to all the model parameters, you might be creating a whole new set of problems. Users now face a "usability tax," trying to figure out which parameters to lock and which to let float. The key takeaway? Focus on addressing the frequency that the same problems are occurring across your users, not just the frequency a given tactic or “solution” method (i.e. “model” or “dashboard” or “feature”) appears in a stakeholder or user request. Remember, problems are often disguised as solutions. We've got to dig deeper and uncover the real needs, not just address the symptoms. (36:19)

Summary In this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by re-imagining the system architectureInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the main factors that contribute to performance challenges in data lake environments?The different components of open data processing systems have evolved from different starting points with different objectives. In your experience, how has that un-planned and un-synchronized evolution of the ecosystem hindered the capabilities and adoption of open technologies?The introduction of a new cross-cutting capability (e.g. Iceberg) has typically taken a substantial amount of time to gain support across different engines and ecosystems. What do you see as the point of highest leverage to improve the capabilities of the entire stack with the least amount of co-ordination?What was the motivating insight that led you to invest in the technology that powers Datapelago?Can you describe the system design of Datapelago and how it integrates with existing data engines?The growth in the generation and application of unstructured data is a notable shift in the work being done by data teams. What are the areas of overlap in the fundamental nature of data (whether structured, semi-structured, or unstructured) that you are able to exploit to bridge the processing gap?What are the most interesting, innovative, or unexpected ways that you have seen Datapelago used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datapelago?When is Datapelago the wrong choice?What do you have planned for the future of Datapelago?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links DatapelagoMIPS ArchitectureARM ArchitectureAWS NitroMellanoxNvidiaVon Neumann ArchitectureTPU == Tensor Processing UnitFPGA == Field-Programmable Gate ArraySparkTrinoIcebergPodcast EpisodeDelta LakePodcast EpisodeHudiPodcast EpisodeApache GlutenIntermediate RepresentationTuring CompletenessLLVMAmdahl's LawLSTM == Long Short-Term MemoryThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here. Edge computing is poised to transform industries by bringing computation and data storage closer to the source of data generation. This shift unlocks new types of value creation with data & AI and allows for a privacy-first and deeply personalized use of AI on our devices. What will the edge computing transition look like? How do you ensure applications are edge-ready, and what is the role of AI in the transition?  Derek Collison is the founder and CEO at Synadia. He is an industry veteran, entrepreneur and pioneer in large-scale distributed systems and cloud computing. Derek founded Synadia Communications and Apcera, and has held executive positions at Google, VMware, and TIBCO Software. He is also an active angel investor and a technology futurist around Artificial Intelligence, Machine Learning, IOT and Cloud Computing. Justyna Bak is VP of Marketing at Synadia. Justyna is a versatile executive bridging Marketing, Sales and Product, a spark-plug for innovation at startups and Fortune 100 and a tech expert in Data Analytics and AI, AppDev and Networking. She is an astute influencer, panelist and presenter (Google, HBR) and a respected leader in Silicon Valley and Europe. In the episode, Richie, Derek, and Justyna explore the transition from cloud to edge computing, the benefits of reduced latency, real-time decision-making in industries like manufacturing and retail, the role of AI at the edge, and the future of edge-native applications, and much more. Links Mentioned in the Show: SynadiaConnect with Derek and JustynaCourse: Understanding Cloud ComputingRelated Episode: The Database is the Operating System with Mike Stonebraker, CTO & Co-Founder At DBOSRewatch sessions from RADAR: Forward Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here. Staying ahead means knowing what’s happening right now—not minutes or hours later. Real-time analytics promises to help teams react faster, make informed choices, and even predict issues before they arise. But implementing these systems is no small feat, and it requires careful alignment between technical capabilities and business needs. How do you ensure that real-time data actually drives impact? And what should organizations consider to make sure their real-time analytics investments lead to tangible benefits? Zuzanna Stamirowska is the CEO of Pathway.com - the fastest data processing engine on the market which makes real-time intelligence possible. Zuzanna is also the author of the state-of-the-art forecasting model for maritime trade published by the National Academy of Sciences of the USA. While working on this project she saw that the digitization of traditional industries was slowed down by the lack of a software infrastructure capable of doing automated reasoning on top of data streams, in real time. This was the spark to launch Pathway. She holds a Master’s degree in Economics and Public Policy from Sciences Po, Ecole Polytechnique, and ENSAE, as well as a PhD in Complexity Science.. Hélène Stanway is Independent Advisor & Consultant at HMLS Consulting Ltd. Hélène is an award-winning and highly effective insurance leader with a proven track record in emerging technologies, innovation, operations, data, change, and digital transformation. Her passion for actively combining the human element, design, and innovation alongside technology has enabled companies in the global insurance market to embrace change by achieving their desired strategic goals, improving processes, increasing efficiency, and deploying relevant tools. With a special passion for IoT and Sensor Technology, Hélène is a perpetual learner, driven to help delegates succeed.  In the episode, Richie, Zuzanna and Hélène explore real-time analytics, their operational impact, use-cases of real-time analytics across industries, the benefits of adopting real-time analytics, the key roles and stakeholders you need to make that happen, operational challenges, strategies for effective adoption, the real-time of the future, common pitfalls, and much more.  Links Mentioned in the Show:

Pathway

Connect with Zuzanna and HélèneLiArticle: What are digital twins and why do we need them?Course: Time Series Analysis in Power BIRelated Episode: How Real Time Data Accelerates Business Outcomes with George TrujilloSign up to RADAR: Forward Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data

It’s time for another episode of Data Engineering Central Podcast, our third one! Topics in this episode … * Should you use DuckDB or Polars? * Small Engineering Changes (PR Reviews) * Daft vs Spark on Databricks with Unity Catalog (Delta Lake) * Primary and Foreign keys in the Lake House Enjoy!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Welcome to the Data Engineering Central Podcast —— a no-holds-barred discussion on the Data Landscape. Welcome to Episode 01 In today’s episode we will talk about the following topics from the Data Engineering perspective … * Snowflake vs Databricks. * Is Apache Spark being replaced?? * Notebooks in Production. Bad.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Perhaps the biggest complaint about generative AI is hallucination. If the text you want to generate involves facts, for example, a chatbot that answers questions, then hallucination is a problem. The solution to this is to make use of a technique called retrieval augmented generation, where you store facts in a vector database and retrieve the most appropriate ones to send to the large language model to help it give accurate responses. So, what goes into building vector databases and how do they improve LLM performance so much? Ram Sriharsha is currently the CTO at Pinecone. Before this role, he was the Director of Engineering at Pinecone and previously served as Vice President of Engineering at Splunk. He also worked as a Product Manager at Databricks. With a long history in the software development industry, Ram has held positions as an architect, lead product developer, and senior software engineer at various companies. Ram is also a long time contributor to Apache Spark.  In the episode, Richie and Ram explore common use-cases for vector databases, RAG in chatbots, steps to create a chatbot, static vs dynamic data, testing chatbot success, handling dynamic data, choosing language models, knowledge graphs, implementing vector databases, innovations in vector data bases, the future of LLMs and much more.  Links Mentioned in the Show: PineconeWebinar - Charting the Path: What the Future Holds for Generative AICourse - Vector Databases for Embeddings with PineconeRelated Episode: The Power of Vector Databases and Semantic Search with Elan Dekel, VP of Product at PineconeRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

The Data Product Management In Action podcast, brought to you by Soda and executive producer Scott Hirleman, is a platform for data product management practitioners to share insights and experiences. We've released a special edition series of minisodes of our podcast. Recorded live at Data Connect 2024, our host Michael Toland engages in short, sweet, informative, and delightful conversations with five prevelant practitioners who are forging their way forward in data and technology.

About our host Michael Toland: Michael is a Product Management Coach and Consultant with Pathfinder Product, a Test Double Operation. Since 2016, Michael has worked on large-scale system modernizations and migration initiatives at Verizon. Outside his professional career, Michael serves as the Treasurer for the New Leaders Council, mentors with Venture for America, sings with the Columbus Symphony, and writes satire for his blog Dignified Product. He is excited to discuss data product management with the podcast audience. Connect with Michael on LinkedIn About our guest Jean-Georges Perrin: Jean-Georges “jgp” Perrin is the Chief Innovation Officer at AbeaData, where he focuses on developing cutting-edge data tooling. He chairs the Open Data Contract Standard (ODCS) at the Linux Foundation's Bitol project, co-founded the AIDA User Group, and has authored several influential books, including Implementing Data Mesh (O'Reilly) and Spark in Action, 2nd Edition (Manning). With over 25 years in IT, Jean-Georges is recognized as a Lifetime IBM Champion, a PayPal Champion, and a Data Mesh MVP. His expertise spans data engineering, governance, and the industrialization of data science. Outside of tech, he enjoys exploring Upstate New York and New England with his family. Connect with J-GP on LinkedIn.  All views and opinions expressed are those of the individuals and do not necessarily reflect their employers or anyone else. Join the conversation on LinkedIn. Apply to be a guest or nominate a practitioner.  Do you love what you're listening to? Please rate and review the podcast, and share it with fellow practitioners you know. Your support helps us reach more listeners and continue providing valuable insights!

Summary

Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou

Interview

Introduction How did you get involved in the area of data management? Can you describe what Microsoft Fabric is and the story behind it? Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend? Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution?

What are the elements of Fabric that were engineered specifically for the service? What are the most interesting/complicated integration challenges?

How has your prior experience with Ahana and Presto informed your current work at Microsoft? AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine?

What are the challenges in terms of safety and reliability?

What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically? When is Fabric the wrong choice? What do you have planned for the future of data lake analytics?

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

Microsoft Fabric Ahana episode DB2 Distributed Spark Presto Azure Data MAD Landscape

Podcast Episode ML Podcast Episode

Tableau dbt Medallion Architecture Microsoft Onelake ORC Parquet Avro Delta Lake Iceberg

Podcast Episode

Hudi

Podcast Episode

Hadoop PowerBI

Podcast Episode

Velox Gluten Apache XTable GraphQL Formula 1 McLaren

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Starburst: Starburst Logo

This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by T

Summary

Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse

Interview

Introduction How did you get involved in the area of data management? Can you describe what role Trino and Iceberg play in Stripe's data architecture?

What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure?

What were the requirements and selection criteria that led to the selection of that combination of technologies?

What are the other systems that feed into and rely on the Trino/Iceberg service?

what kinds of questions are you answering with table metadata

what use case/team does that support

comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure? When is a lakehouse on Trino/Iceberg the wrong choice? What do you have planned for the future of Trino and Iceberg at Stripe?

Contact Info

Substack LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

Trino Iceberg Stripe Spark Redshift Hive Metastore Python Iceberg Python Iceberg REST Catalog Trino Metadata Table Flink

Podcast Episode

Tabular

Podcast Episode

Delta Table

Podcast Episode

Databricks Unity Catalog Starburst AWS Athena Kevin Trinofest Presentation Alluxio

Podcast Episode

Parquet Hudi Trino Project Tardigrade Trino On Ice

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Starburst: Starburst Logo

This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake.

Trusted by the teams at Comcast and Doordash, Starburst del

Send us a text This is always a good interview; an interview from last year with Dr. Mindy Weinstein discussing how we try to reach humans through digital marketing, and the power of scarcity. I am reminded of this concept every time I look at the resell price of Taylor Swift tickets.

Marketing : The Power of Scarcity with Mindy Weinstein, Founder and CEO of Market MindShift, Marketing instructor for Grand Canyon University, Columbia Business School, and Wharton. "Trying to reach humans" through digital marketing. Original episode season 7 Episode 5. 01:41 Meet "Marketer" Mindy Weinstein04:42 Technology in Marketing07:50 One of the top women in digital marketing09:12 The power of scarcity19:16 Four types of scarcity20:41 Bourbon scarcity21:47 Businesses leveraging scarcityLinkedIn: linkedin.com/in/mindydweinstein Website: https://www.persuasioninbusiness.com/book, https://www.marketmindshift.com/

Summary of Dr. Weinstein's book: Drive revenue and grow your business by using the powerful concept of scarcity Scarcity isn't just one of the key principles of influence, it's arguably the most powerful―invoking the kind of primal instincts that were essential to our ancestors' survival. It's also the explanation for why, in the mid-1990's, $29.99 Tickle-Me-Elmo dolls were being scalped for $7,000 apiece. And yet, for all its power, scarcity is a principle that's little understood, even as it's frequently employed in sales and marketing campaigns. Research on scarcity is published mainly in academic journals, not easily accessible to the mainstream public, and often written from an economic, rather than psychological, point of view. In The Power of Scarcity, Dr. Mindy Weinstein leverages her deep expertise in both marketing and psychology to reveal how this influence principle can be used to boost sales, win negotiations, spark action, develop community, build customer loyalty, and more. As a digital marketer and doctor of philosophy in psychology, she brings both practical and academic insights to explain the psychology behind scarcity, why it has such an immense impact on decision making, and how, used correctly and ethically, it can influence the people who buy your products or services. In these pages, you'll gain a deeper understanding of why and how scarcity works in business, and specifically how different types of scarcity messages―supply related, demand related, time related or limited edition―affect our brains. You'll see it in action from multiple perspectives, through case studies, research findings, and eye-opening interviews with current and former executives (from brands that include McDonald’s, Harry & David, and 1-800-Flowers), as well as real-life customers' firsthand experiences. For anyone involved in sales and marketing today, The Power of Scarcity is a rare find, combining the best research on the subject as well as hands-on, tactical ways to apply the psychology behind it to knowledgeably harness that power to bolster your business. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that flow like your morning coffee, where industry insights meet laid-back banter. Whether you're a data aficionado or just curious about the digital age, pull up a chair and let's explore the heart of data, unplugged style!

Stack Overflow and OpenAI Deal Controversy: Discussing the partnership controversy, with users protesting the lack of an opt-out option and how this could reshape the platform. Look into Phind here.Apple and OpenAI Rumors - could ChatGPT be the new Siri? Examining rumors of ChatGPT potentially replacing Siri, and Apple's AI strategy compared to Microsoft’s MAI-1. Check out more community opinions here.Hello GPT-4o: Exploring the new era with OpenAI's GPT-4o that blends video, text, and audio for more dynamic human-AI interactions. Discussing AI's challenges under the European AI Act and chatgpt’s use in daily life and dating apps like Bumble.Claude Takes Europe: Claude 3 now available in the EU. How does it compare to ChatGPT in coding and conversation?ElevenLabs' Music Generation AI: A look at ElevenLabs' AI for generating music and the broader AI music landscape. How are these algorithms transforming music creation? Check out the AI Song Contest here.Google Cloud’s Big Oops with UniSuper: Unpack the shocking story of how Google Cloud accidentally wiped out UniSuper’s account. What does this mean for data security and redundancy strategies?The Great CLI Debate: Is Python really the right choice for CLI tools? We spark the debate over Python vs. Go and Rust in building efficient CLI tools.

This episode features Alli Torban, a leading data information designer, sharing her career journey from a data analyst to teaching data visualization to companies like Google and Moderna.

Alli advises on becoming a data viz designer, emphasizing the significance of data literacy, tool mastery, and building a portfolio with personal projects.

Connect with Alli Torban :

🤝 Follow on Linkedin

📔 Learn About Chart Spark

🧙‍♂️ Ace the Interview with Confidence

⁠📩 Get my weekly email with helpful data career tips⁠

⁠📊 Come to my next free “How to Land Your First Data Job” training⁠

⁠🏫 Check out my 10-week data analytics bootcamp

Timestamps:

(08:16) Alli's Transition to Freelance and Starting Her Own Company (17:40) Advice for Aspiring Data Visualization Designers (21:42) Unlocking Creativity with Practical Inspiration and Prompts

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa