SQL

[AI and the Modern Data Stack] #181 Why the Future of AI in Data will be Weird with Benn Stancil, CTO at Mode & Field CTO at ThoughtSpot

2024-02-19 · DataFramed Listen

podcast_episode

by Adel (DataFramed) , Benn Stancil (Mode)

AI/ML Analytics Big Data Databricks DWH GenAI Marketing Modern Data Stack Microsoft Snowflake Thoughtspot

One of the biggest surprises of the generative AI revolution over the past 2 years lies in the counter-intuitiveness of its most successful use cases. Counter to most predictions made about AI years ago, AI-assisted coding, specifically AI-assisted data work, has been surprisingly one of the biggest killer apps of generative AI tools and copilots. However, what happens when we take this notion even further? How will analytics workflows look like when generative AI tools can also assist us in problem-solving? What type of analytics use cases can we expect to operationalize, and what tools can we expect to work with when AI systems can provide scalable qualitative data instead of relying on imperfect quantitative proxies? Today’s guest calls this future “weird”. Benn Stancil is the Field CTO at ThoughtSpot. He joined ThoughtSpot in 2023 as part of its acquisition of Mode, where he was a Co-Founder and CTO. While at Mode, Benn held roles leading Mode’s data, product, marketing, and executive teams. He regularly writes about data and technology at benn.substack.com. Prior to founding Mode, Benn worked on analytics teams at Microsoft and Yammer. Throughout the episode, Benn and Adel talk about the nature of AI-assisted analytics workflows, the potential for generative AI in assisting problem-solving, how he imagines analytics workflows to look in the future, and a lot more. About the AI and the Modern Data Stack DataFramed Series This week we’re releasing 4 episodes focused on how AI is changing the modern data stack and the analytics profession at large. The modern data stack is often an ambiguous and all-encompassing term, so we intentionally wanted to cover the impact of AI on the modern data stack from different angles. Here’s what you can expect: Why the Future of AI in Data will be Weird with Benn Stancil, CTO at Mode & Field CTO at ThoughtSpot — Covering how AI will change analytics workflows and tools How Databricks is Transforming Data Warehousing and AI with Ari Kaplan, Head Evangelist & Robin Sutara, Field CTO at Databricks — Covering Databricks, data intelligence and how AI tools are changing data democratizationAdding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake — Covering Snowflake and its uses, how generative AI is changing the attitudes of leaders towards data, and how to improve your data managementAccelerating AI Workflows with Nuri Cankaya, VP of AI Marketing & La Tiffaney Santucci, AI Marketing Director at Intel — Covering AI’s impact on marketing analytics, how AI is being integrated into existing products, and the democratization of AI Links Mentioned in the Show: Mode AnalyticsThoughtSpot acquires Mode: Empowering data teams to bring Generative AI to BIEverybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are[Course] Generative AI for Business[Skill Track] SQL FundamentalsRelated Episode: The Future of Marketing Analytics with Cory Munchbach, CEO at...

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

2024-02-18 · Data Engineering Podcast Listen

podcast_episode

by Dain Sundstrom (Starburst) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Delta Hudi Iceberg +2 more

Summary

A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today. Your host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg

Interview

Introduction How did you get involved in the area of data management? To start, can you share your definition of what constitutes a "Data Lakehouse"?

What are the technical/architectural/UX challenges that have hindered the progression of lakehouses? What are the notable advancements in recent months/years that make them a more viable platform choice?

There are multiple tools and vendors that have adopted the "data lakehouse" terminology. What are the benefits offered by the combination of Trino and Iceberg?

What are the key points of comparison for that combination in relation to other possible selections?

What are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems?

What progress is being made (within or across the ecosystem) to address those sharp edges?

For someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements? What are the differences in terms of pipeline design/access and usage patterns when using a Trino

Data Sharing Across Business And Platform Boundaries

2024-02-11 · Data Engineering Podcast Listen

podcast_episode

by Andy Jefferson , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg Python +2 more

Summary

Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing

Interview

Introduction How did you get involved in the area of data management? Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation? What is the current state of the ecosystem for data sharing protocols/practices/platforms?

What are some of the main challenges/shortcomings that teams/organizations experience with these options?

What are the technical capabilities that need to be present for an effective data sharing solution?

How does that change as a function of the type of data? (e.g. tabular, image, etc.)

What are the requirements around governance and auditability of data access that need to be addressed when sharing data? What are the typical boundaries along which data access requires special consideration for how the sharing is managed? Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform? What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing? When is Bobsled the wrong choice? What do you have planned for the future of data sharing?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine

Learn Python the Hard Way: A Deceptively Simple Introduction to the Terrifyingly Beautiful World of Computers and Data Science, 5th Edition

2024-02-07 · O'Reilly Data Science Books O'Reilly Amazon

book

by Zed A. Shaw

Data Science Python programming-languages software-development

You Will Learn Python! Zed Shaw has created the world's most reliable system for learning Python. Follow it and you will succeed--just like the millions of beginners Zed has taught to date! You bring the discipline, persistence, and attention; the author supplies the masterful knowledge you need to succeed. In Learn Python the Hard Way, Fifth Edition, you'll learn Python by working through 60 lovingly crafted exercises. Read them. Type in the code. Run it. Fix your mistakes. Repeat. As you do, you'll learn how a computer works, how to solve problems, and how to enjoy programming . . . even when it's driving you crazy. Install a complete Python environment Organize and write code Fix and break code Basic mathematics Strings and text Interact with users Work with files Looping and logic Object-oriented programming Data structures using lists and dictionaries Modules, classes, and objects Python packaging Automated testing Basic SQL for Data Science Web scraping Fixing bad data (munging) The "Data" part of "Data Science" It'll be frustrating at first. But if you keep trying, you'll get it--and it'll feel amazing! This course will reward you for every minute you put into it. Soon, you'll know one of the world's most powerful, popular programming languages. You'll be a Python programmer. This Book Is Perfect For Total beginners with zero programming experience Junior developers who know one or two languages Returning professionals who haven't written code in years Aspiring Data Scientists or academics who need to learn to code Seasoned professionals looking for a fast, simple crash course in Python for Data Science Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

Tackling Real Time Streaming Data With SQL Using RisingWave

2024-02-04 · Data Engineering Podcast Listen

podcast_episode

by Yingjun Wu (RisingWave Labs) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Delta DWH GitHub Hudi +5 more

Summary

Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3

Interview

Introduction How did you get involved in the area of data management? Can you describe what RisingWave is and the story behind it? There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses?

What are some of the platforms/architectures that teams are replacing with RisingWave?

What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem? Can you describe how RisingWave is architected and implemented?

How have the design and goals/scope changed since you first started working on it? What are the core design philosophies that you rely on to prioritize the ongoing development of the project?

What are the most complex engineering challenges that you have had to address in the creation of RisingWave? Can you describe a typical workflow for teams that are building on top of RisingWave?

What are the user/developer experience elements that you have prioritized most highly?

What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine? What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave? When is RisingWave the wrong choice? What do you have planned for the future of RisingWave?

Contact Info

yingjunwu on GitHub Personal Website LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows.

Build A Data Lake For Your Security Logs With Scanner

2024-01-29 · Data Engineering Podcast Listen

podcast_episode

by Cliff Crosland (Scanner) , Tobias Macey

AI/ML Analytics API Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg Python Cyber Security +1 more

Summary

Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Cliff Crosland about Scanner, a security data lake platform for analyzing security logs and identifying issues quickly and cost-effectively

Interview

Introduction How did you get involved in the area of data management? Can you describe what Scanner is and the story behind it?

What were the shortcomings of other tools that are available in the ecosystem?

What is Scanner explicitly not trying to solve for in the security space? (e.g. SIEM) A query engine is useless without data to analyze. What are the data acquisition paths/sources that you are designed to work with?- e.g. cloudtrail logs, app logs, etc.

What are some of the other sources of signal for security monitoring that would be valuable to incorporate or integrate with through Scanner?

Log data is notoriously messy, with no strictly defined format. How do you handle introspection and querying across loosely structured records that might span multiple sources and inconsistent labelling strategies? Can you describe the architecture of the Scanner platform?

What were the motivating constraints that led you to your current implementation? How have the design and goals of the product changed since you first started working on it?

Given the security oriented customer base that you are targeting, how do you address trust/network boundaries for compliance with regulatory/organizational policies? What are the personas of the end-users for Scanner?

How has that influenced the way that you think about the query formats, APIs, user experience etc. for the prroduct?

For teams who are working with Scanner can you describe how it fits into their workflow? What are the most interesting, innovative, or unexpected ways that you have seen Scanner used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Scanner? When is Scanner the wrong choice? What do you have planned for the future of Scanner?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the s

Modern Customer Data Platform Principles

2024-01-22 · Data Engineering Podcast Listen

podcast_episode

by Tasso Argyros (ActionIQ) , Tobias Macey

AI/ML Analytics CDP Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Delta ETL/ELT Hudi Iceberg +3 more

Summary

Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro. Your host is Tobias Macey and today I'm interviewing Tasso Argyros about the role of a customer data platform in the context of the modern data stack

Interview

Introduction How did you get involved in the area of data management? Can you describe what the role of the CDP is in the context of a businesses data ecosystem?

What are the core technical challenges associated with building and maintaining a CDP? What are the organizational/business factors that contribute to the complexity of these systems?

The early days of CDPs came with the promise of "Customer 360". Can you unpack that concept and how it has changed over the past ~5 years? Recent years have seen the adoption of reverse ETL, cloud data warehouses, and sophisticated product analytics suites. How has that changed the architectural approach to CDPs?

How have the architectural shifts changed the ways that organizations interact with their customer data?

How have the responsibilities shifted across different roles?

What are the governance policy and enforcement challenges that are added with the expansion of access and responsibility?

What are the most interesting, innovative, or unexpected ways that you have seen CDPs built/used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDPs? When is a CDP the wrong choice? What do you have planned for the future of ActionIQ?

Contact Info

LinkedIn @Tasso on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being us

93: I want to be a Data Analyst, but don’t have experience

2024-01-17 · Data Career Podcast: Helping You Land a Data Analyst Job FAST Listen

podcast_episode

by Avery Smith

AI/ML Analytics BI Data Analytics Data Engineering Power BI Tableau

In this episode of the Data Career Podcast, we include a variety of listener questions, shedding light on topics like the future of data engineering, requirements for becoming a data analyst, showcasing data cleaning proficiency in Excel, and securing data analyst internships.

Also discusses the significance of storytelling and views on Power BI versus Tableau & the impact of AI on data analysis roles.

Tune in now!

👍 Leave your review and download the bonus!

🤝 Ace your data analyst interview with the interview simulator

📩 Get my weekly email with helpful data career tips

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Timestamps:

(02:10) - What’s the future of data engineering in 2024? (03:06) - Do you need a degree to become a data analyst? (04:57) - How to showcase Excel skills? (07:22) - How to land data analyst internships? (10:10) - What are the main technical skills required to land your first data job? (14:40) - Have you worked with many teachers looking to make a career transition? (16:24) - How to get a data analyst job for people with no work experience? (25:13) - Can you suggest SQL and Excel videos for data analysis? (28:46) - Do you think the data analysis industry is saturated? (28:21) - Do you find data analysts transferring to becoming a data scientist or a data engineer?

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

PostgreSQL Query Optimization: The Ultimate Guide to Building Efficient Queries

2024-01-08 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Anna Bailliekova , Henrietta Dombrovskaya , Boris Novikov

data data-engineering postgresql relational-databases

Write optimized queries. This book helps you write queries that perform fast and deliver results on time. You will learn that query optimization is not a dark art practiced by a small, secretive cabal of sorcerers. Any motivated professional can learn to write efficient queries from the get-go and capably optimize existing queries. You will learn to look at the process of writing a query from the database engine’s point of view, and know how to think like the database optimizer. The book begins with a discussion of what a performant system is and progresses to measuring performance and setting performance goals. It introduces different classes of queries and optimization techniques suitable to each, such as the use of indexes and specific join algorithms. You will learn to read and understand query execution plans along with techniques for influencing those plans for better performance. The book also covers advanced topics such as the use of functions and procedures, dynamic SQL, and generated queries. All of these techniques are then used together to produce performant applications, avoiding the pitfalls of object-relational mappers. This second edition includes new examples using Postgres 15 and the newest version of the PostgresAir database. It includes additional details and clarifications about advanced topics, and covers configuration parameters in greater depth. Finally, it makes use of advancements in NORM, using automatically generated functions. What You Will Learn Identify optimization goals in OLTP and OLAP systems Read and understand PostgreSQL execution plans Distinguish between short queries and long queries Choose the right optimization technique for each query type Identify indexes that will improve query performance Optimize full table scans Avoid the pitfalls of object-relational mapping systems Optimize the entire application rather than just database queries Who This Book Is For IT professionals working in PostgreSQL who want to develop performant and scalable applications, anyone whose job title contains the words “database developer” or “database administrator" or who is a backend developer charged with programming database calls, and system architects involved in the overall design of application systems running against a PostgreSQL database

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

2024-01-07 · Data Engineering Podcast Listen

podcast_episode

by Jignesh Patel (Carnegie Mellon University) , Tobias Macey

AI/ML Analytics Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg Python Trino

Summary

Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Jignesh Patel about the research that he is conducting on technical scalability and user experience improvements around data management

Interview

Introduction How did you get involved in the area of data management? Can you start by summarizing your current areas of research and the motivations behind them? What are the open questions today in technical scalability of data engines?

What are the experimental methods that you are using to gain understanding in the opportunities and practical limits of those systems?

As you strive to push the limits of technical capacity in data systems, how does that impact the usability of the resulting systems?

When performing research and building prototypes of the projects, what is your process for incorporating user experience into the implementation of the product?

What are the main sources of tension between technical scalability and user experience/ease of comprehension? What are some of the positive synergies that you have been able to realize between your teaching, research, and corporate activities?

In what ways do they produce conflict, whether personally or technically?

What are the most interesting, innovative, or unexpected ways that you have seen your research used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on research of the scalability limits of data systems? What is your heuristic for when a given research project needs to be terminated or productionized? What do you have planned for the future of your academic research?

Contact Info

Website LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tel

91: If I Wanted to Become a Data Analyst In 2024, This is What I'd Do [FULL BLUEPRINT]

2024-01-05 · Data Career Podcast: Helping You Land a Data Analyst Job FAST Listen

podcast_episode

by Avery Smith

AI/ML Analytics Data Analytics

This is the blueprint to becoming a data analyst & I will walk you through the levels step-by-step to becoming one this year.

Follow this blueprint, and I promise you can become a data analyst. If you’d like more of my free resources, including a more in-depth webinar where I talk more about projects & networking, click the link to the descriptions.

Practice SQL with :

🛠️ Analyst Builder

🖱️ Stratascratch

🐒 DataLemur

🤝 Ace your data analyst interview with the interview simulator

📩 Get my weekly email with helpful data career tips

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Timestamps:

(00:08) - Level 1

(04:20) - Level 2

(06:15) - Level 3

(07:50) - Level 4

(09:10) - Level 5

(13:05) - Level 6

(15:30) - Level 7

(18:20) - Level 8

(19:42) - Level 9

(21:50) - Level 10

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Designing Data Platforms For Fintech Companies

2024-01-01 · Data Engineering Podcast Listen

podcast_episode

by Andrey Korchak (Monite) , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Governance Data Lake Data Lakehouse Data Management Dataflow Delta Hudi Iceberg +6 more

Summary

Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Andrey Korchak about how to manage data in a fintech environment

Interview

Introduction How did you get involved in the area of data management? Can you start by summarizing the data challenges that are particular to the fintech ecosystem? What are the primary sources and types of data that fintech organizations are working with?

What are the business-level capabilities that are dependent on this data?

How do the regulatory and business requirements influence the technology landscape in fintech organizations?

What does a typical build vs. buy decision process look like?

Fraud prediction in e.g. banks is one of the most well-established applications of machine learning in industry. What are some of the other ways that ML plays a part in fintech?

How does that influence the architectural design/capabilities for data platforms in those organizations?

Data governance is a notoriously challenging problem. What are some of the strategies that fintech companies are able to apply to this problem given their regulatory burdens? What are the most interesting, innovative, or unexpected approaches to data management that you have seen in the fintech sector? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data in fintech? What do you have planned for the future of your data capabilities at Monite?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Monite ISO 270001 Tesseract GitOps SWIFT Protocol

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Starburst: Starburst Logo

This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics.

Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. dataengineeringpodcast.com/starburstRudderstack:

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackMaterialize:

You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.

That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.

Go to materialize.com today and get 2 weeks free!Support Data Engineering Podcast

Troubleshooting Kafka In Production

2023-12-24 · Data Engineering Podcast Listen

podcast_episode

by Elad Eldor , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg Kafka SaaS +2 more

Summary

Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Elad Eldor about operating Kafka in production and how to keep your clusters stable and performant

Interview

Introduction How did you get involved in the area of data management? Can you describe your experiences with Kafka?

What are the operational challenges that you have had to overcome while working with Kafka? What motivated to write a book about how to manage Kafka in production?

There are many options now for persistent data queues. What are the factors to consider when determining whether Kafka is the right choice?

In the case where Kafka is the appropriate tool, there are many ways to run it now. What are the considerations that teams need to work through when determining whether/where/how to operate a cluster?

When provisioning a Kafka cluster, what are the requirements that need to be considered when determining the sizing?

What are the axes along which size/scale need to be determined?

The core promise of Kafka is that it is a durable store for continuous data. What are the mechanisms that are available for preventing data loss?

Under what circumstances can data be lost?

What are the different failure conditions that cluster operators need to be aware of?

What are the monitoring strategies that ar

Adding An Easy Mode For The Modern Data Stack With 5X

2023-12-18 · Data Engineering Podcast Listen

podcast_episode

by Tarush Aggarwal (5xData) , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg Modern Data Stack SaaS +2 more

Summary

The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm welcoming back Tarush Aggarwal to talk about what he and his team at 5x data are building to improve the user experience of the modern data stack.

Interview

Introduction How did you get involved in the area of data management? Can you describe what 5x is and the story behind it?

We last spoke in March of 2022. What are the notable changes in the 5x business and product?

What are the notable shifts in the data ecosystem that have influenced your adoption and product direction?

What trends are you most focused on tracking as you plan the continued evolution of your offerings?

What are the points of friction that teams run into when trying to build their data platform? Can you describe design of the system that you have built?

What are the strategies that you rely on to support adaptability and speed of onboarding for new integrations?

What are some of the types of edge cases that you have to deal with while integrating and operating the platform implementations that you design for your customers? What is your process for selection of vendors to support?

How would you characte

Build supergraphs, not APIs

2023-12-15 · GraphQL Berlin Meetup #27

talk

by Tom Harding (Hasura)

NoSQL github api graphql hasura

Data is power, but building APIs is tedious. Engineers create vital value by modelling domains and data, but waste time on repetitive plumbing tasks like CRUD, data pipelines, and cross data source joins. What if you could skip all that? With supergraph, you can. Query the supergraph with GraphQL, and get consistent features like joins, filtering, and aggregations across all data sources. Set permissions where they belong: at the model level, where they can be applied to absolutely any query. Live demo of how to build a supergraph that connects the GitHub API with a users database will be presented.

88: How to Work with Recruiters as a Data Analyst ft. Hire-Fit

2023-12-13 · Data Career Podcast: Helping You Land a Data Analyst Job FAST Listen

podcast_episode

by Avery Smith , Katie Jolles (Hire-Fit) , Bobby Aragon (Hire-Fit)

AI/ML Analytics Data Analytics

How can you work with a recruiter in your data journey? Listen to find out!

Katie and Bobby from Hire-Fit share their insights on creating a killer LinkedIn profile, tailoring your resume to stand out, and making human connections that will make a lasting impression on hiring managers.

Listen if you want a leg-up in the job hunt!

Connect with HireFit:

🤝 Follow on Linkedin

🤝 Connect with Bobby Aragon

🤝 Connect with Katie Jolles

🎒 Learn About HireFit

🗄️ Join FREE SQL hands-on workshop this December!⁠

⁠⭐ Leave a Podcast review & get your bonus!⁠

⁠🤝 Ace your data analyst interview with the interview simulator⁠

⁠📩 Get my weekly email with helpful data career tips⁠

⁠📊 Come to my next free “How to Land Your First Data Job” training⁠

⁠🏫 Check out my 10-week data analytics bootcamp⁠

Timestamps:

(11:19) - Recruiters Don’t Want to Help: myth or truth

(28:47) - LinkedIn is KING for recruiters & applicants

(35:48) - Work smarter, not harder in your job search

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

2023-12-11 · Data Engineering Podcast Listen

podcast_episode

by Andrew Maguire , Tobias Macey

AI/ML Analytics Cloud Computing Dashboard Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg SaaS +2 more

Summary

If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrew Maguire about his work on the Anomstack project and how you can use it to run your own anomaly detection for your metrics

Interview

Introduction How did you get involved in the area of data management? Can you describe what Anomstack is and the story behind it?

What are your goals for this project? What other tools/products might teams be evaluating while they consider Anom

Analytics Engineering with SQL and dbt

2023-12-08 · O'Reilly SQL Books O'Reilly Amazon

book

by Rui Pedro Machado , Helder Russa

Analytics Analytics Engineering BI Data Engineering dbt DWH

With the shift from data warehouses to data lakes, data now lands in repositories before it's been transformed, enabling engineers to model raw data into clean, well-defined datasets. dbt (data build tool) helps you take data further. This practical book shows data analysts, data engineers, BI developers, and data scientists how to create a true self-service transformation platform through the use of dynamic SQL. Authors Rui Machado from Monstarlab and Hélder Russa from Jumia show you how to quickly deliver new data products by focusing more on value delivery and less on architectural and engineering aspects. If you know your business well and have the technical skills to model raw data into clean, well-defined datasets, you'll learn how to design and deliver data models without any technical influence. With this book, you'll learn: What dbt is and how a dbt project is structured How dbt fits into the data engineering and analytics worlds How to collaborate on building data models The main tools and architectures for building useful, functional data models How to fit dbt into data warehousing and laking architecture How to build tests for data transformations

Yingjun Wu: Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing

2023-12-07 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Yingjun Wu (RisingWave Labs)

Analytics Big Data Data Analytics Data Streaming

Join Yingjun Wu as we unlock the power of real-time insights in 'Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing.' 🚀 Explore how to leverage Change Data Capture (CDC) and modern SQL streaming databases to revolutionize your data analytics, and discover the magic of materialized views for instant, actionable insights. 📈💡 #RealTimeInsights #streamprocessing

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

87: Becoming a Data Analyst with Elijah Butler

2023-12-06 · Data Career Podcast: Helping You Land a Data Analyst Job FAST Listen

podcast_episode

by Avery Smith , Elijah Butler (Humana)

AI/ML Analytics Data Analytics

Join TikTok star, Elijah Butler, a Data Analyst at Humana, as we discuss his journey into data analytics, share valuable insights about the importance of networking, and ponder over the necessity of a master's degree in the field.

The episode provides an interesting blend of professional and personal life experiences and is packed with valuable advice for anyone aspiring to advance their career in data analytics. Don't miss out on these insights!

Connect with Elijah Butler:

🤝 Connect on Linkedin

📲 Follow on TikTok

🗄️ Join FREE SQL hands-on workshop this December!

⭐ Leave a Podcast review & get your bonus!

🤝 Ace your data analyst interview with the interview simulator

📩 Get my weekly email with helpful data career tips

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Timestamps:

(8:37) - Elijah's Journey becoming a data analyst

(17:00) - Networking matters more than you think

(21:00) - Master the tools

(35:11) - Book recommendations

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

talk-data.com

Activity Trend

Top Events

Top Speakers

[AI and the Modern Data Stack] #181 Why the Future of AI in Data will be Weird with Benn Stancil, CTO at Mode & Field CTO at ThoughtSpot

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Sharing Across Business And Platform Boundaries

Learn Python the Hard Way: A Deceptively Simple Introduction to the Terrifyingly Beautiful World of Computers and Data Science, 5th Edition

Tackling Real Time Streaming Data With SQL Using RisingWave

Build A Data Lake For Your Security Logs With Scanner

Modern Customer Data Platform Principles

93: I want to be a Data Analyst, but don’t have experience

PostgreSQL Query Optimization: The Ultimate Guide to Building Efficient Queries

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

91: If I Wanted to Become a Data Analyst In 2024, This is What I'd Do [FULL BLUEPRINT]

Designing Data Platforms For Fintech Companies

Troubleshooting Kafka In Production

Adding An Easy Mode For The Modern Data Stack With 5X

Build supergraphs, not APIs

88: How to Work with Recruiters as a Data Analyst ft. Hire-Fit

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Analytics Engineering with SQL and dbt

Yingjun Wu: Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing

87: Becoming a Data Analyst with Elijah Butler