Data Engineering

Data Engineering Central Podcast - 03

2024-10-16 · Data Engineering Central Podcast Listen

podcast_episode

by Daniel Beach

Databricks Delta DuckDB Polars Spark

It’s time for another episode of Data Engineering Central Podcast, our third one! Topics in this episode … * Should you use DuckDB or Polars? * Small Engineering Changes (PR Reviews) * Daft vs Spark on Databricks with Unity Catalog (Delta Lake) * Primary and Foreign keys in the Lake House Enjoy!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Coalesce 2024: How Cox Automotive turbocharged data engineering with ELT

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Somnath Chatterjee (Cox Automotive) , Brett Darcy (Cox Automotive)

Alteryx Analytics Cloud Computing dbt ETL/ELT

A migration story with the underlying philosophy, and strategic approach to move from a low-code ETL tool like Alteryx to a modern data engineering mindset with dbt. This transition is not just about adopting new tools but embracing a code-first philosophy that promotes best practices in software engineering, such as modularity, reusability, and transparency.

Speakers: Somnath Chatterjee Lead Data Engineer Cox Automotive

Brett Darcy Lead Software Engineer Cox Automotive

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: Scaling data transformations with dbt: Symend's journey towards efficiency

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Ramanpreet Singh (Symend)

Analytics Cloud Computing dbt

Symend is a leading customer engagement solution that has adopted dbt at wide scale with over 70,000 model runs processing 186 million records daily across multiple environments in a multi-tenant environment. To manage this complexity, they rely on dbt Cloud to streamline processes and drive efficiency and have revolutionized data operations to achieve unprecedented efficiency, reliability, and speed at massive scale.

Raman will share insights, experiences, and best practices for leveraging dbt at scale, including how they have achieved a 70% decrease in daily warehouse computing costs, a 60% reduction in data engineering hours, and a dramatic 90% decrease in debugging issues. In addition, he'll share using dbt Cloud has not only reduced latency and improved data rebuild times, but has also instilled trust in our data and data teams, enabling them to ship data products faster and at a lower cost.

Speaker: Ramanpreet Singh Engineer Manager, Analytics Symend

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: From Core to Cloud: Unlocking dbt at Warner Brothers Discovery (CNN)

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Mamta Gupta (Warner Brothers Discovery) , Zachary Lancaster (Warner Brothers Discovery)

Airflow Analytics AWS Cloud Computing Data Quality dbt IaC Terraform

Since the beginning of 2024, the Warner Brothers Discovery team supporting the CNN data platform has been undergoing an extensive migration project from dbt Core to dbt Cloud. Concurrently, the team is also segmenting their project into multi-project frameworks utilizing dbt Mesh. In this talk, Zachary will review how this transition has simplified data pipelines, improved pipeline performance and data quality, and made data collaboration at scale more seamless.

He'll discuss how dbt Cloud features like the Cloud IDE, automated testing, documentation, and code deployment have enabled the team to standardize on a single developer platform while also managing dependencies effectively. He'll share details on how the automation framework they built using Terraform streamlines dbt project deployments with dbt Cloud to a ""push-button"" process. By leveraging an infrastructure as code experience, they can orchestrate the creation of environment variables, dbt Cloud jobs, Airflow connections, and AWS secrets with a unified approach that ensures consistency and reliability across projects.

Speakers: Mamta Gupta Staff Analytics Engineer Warner Brothers Discovery

Zachary Lancaster Manager, Data Engineering Warner Brothers Discovery

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: Securing data access with dbt lineage and dbt grants

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Marco Albuquerque (Carta)

Analytics Cloud Computing dbt

Carta prioritizes ensuring that their data is accessible only to the right individuals. Previously, they used functional groups to manage data access, but this approach often fell short in perfectly governing granular data sets. Recently, Carta developed a new access management system utilizing dbt lineage and dbt grants. These tools enable them to automatically propagate data access tags defined in dbt sources. This innovative system allows them to confidently ensure that individuals have appropriate access to data.

Speaker: Marco Albuquerque Senior Engineering Manager - Data Engineering Carta

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: How SurveyMonkey sharpens dbt performance and governance with data observability

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Samiksha Gour (SurveyMonkey)

Analytics Cloud Computing dbt DWH Monte Carlo Snowflake

The data team at SurveyMonkey, the global leader in survey software, oversees heavy data transformation in dbt Cloud — both to power current business-critical projects, and also to migrate legacy workloads. Much of that transformation work is taking raw data — either from legacy databases or their cloud data warehouse (Snowflake) — and making it accessible and useful for downstream users. And to Samiksha Gour, Senior Data Engineering Manager at SurveyMonkey, each of these projects is not considered complete unless the proper checks, monitors, and alerts are in place.

Join Samiksha in this informative session as she walks through how her team uses dbt and their data observability platform Monte Carlo to ensure proper governance, gain efficiencies by eliminating duplicate testing and monitoring, and use data lineage to ensure upstream and downstream continuity for users and stakeholders.

Speaker: Samiksha Gour Senior Data Engineering Manager SurveyMonkey

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

#252 Is Big Data Dead? MotherDuck and the Small Data Manifesto with Ryan Boyd Co-Founder at MotherDuck

2024-10-14 · DataFramed Listen

podcast_episode

by Ryan Boyd (Databricks) , Richie (DataCamp)

AI/ML Big Data Data Science Databricks Marketing Motherduck Neo4j SQL

Businesses are collecting more data than ever before. But is bigger always better? Many companies are starting to question whether massive datasets and complex infrastructure are truly delivering results or just adding unnecessary costs and complications. How can you make sure your data strategy is aligned with your actual needs? What if focusing on smaller, more manageable datasets could improve your efficiency and save resources, all while delivering the same insights? Ryan Boyd is the Co-Founder & VP, Marketing + DevRel at MotherDuck. Ryan started his career as a software engineer, but since has led DevRel teams for 15+ years at Google, Databricks and Neo4j, where he developed and executed numerous marketing and DevRel programs. Prior to MotherDuck, Ryan worked at Databricks and focussed the team on building an online community during the pandemic, helping to organize the content and experience for an online Data + AI Summit, establishing a regular cadence of video and blog content, launching the Databricks Beacons ambassador program, improving the time to an “aha” moment in the online trial and launching a University Alliance program to help professors teach the latest in data science, machine learning and data engineering. In the episode, Richie and Ryan explore data growth and computation, the data 1%, the small data movement, data storage and usage, the shift to local and hybrid computing, modern data tools, the challenges of big data, transactional vs analytical databases, SQL language enhancements, simple and ergonomic data solutions and much more. Links Mentioned in the Show: MotherDuckThe Small Data ManifestoConnect with RyanSmall DataSF conferenceRelated Episode: Effective Data Engineering with Liya Aizenberg, Director of Data Engineering at AwayRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

The Role of Python in Shaping the Future of Data Platforms with DLT

2024-10-13 · Data Engineering Podcast Listen

podcast_episode

by Marcin Rudolf (DLT Hub) , Tobias Macey , Adrian Brudaru (dlthub)

AI/ML API Arrow Data Lake Data Lakehouse Data Management Datafold DuckDB GenAI Python

Summary In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at dltHub, about the growth of dlt and the numerous ways that you can use it to address the complexities of data integrationInterview IntroductionHow did you get involved in the area of data management?Can you describe what dlt is and how it has evolved since we last spoke (September 2023)?What are the core principles that guide your work on dlt and dlthub?You have taken a very opinionated stance against managed extract/load services. What are the shortcomings of those platforms, and when would you argue in their favor?The landscape of data movement has undergone some interesting changes over the past year. Most notably, the growth of PyAirbyte and the rapid shifts around the needs of generative AI stacks (vector stores, unstructured data processing, etc.). How has that informed your product development and positioning?The Python ecosystem, and in particular data-oriented Python, has also undergone substantial evolution. What are the developments in the libraries and frameworks that you have been able to benefit from?What are some of the notable investments that you have made in the developer experience for building dlt pipelines?How have the interfaces for source/destination development improved?You recently published a post about the idea of a portable data lake. What are the missing pieces that would make that possible, and what are the developments/technologies that put that idea within reach?What is your strategy for building a sustainable product on top of dlt?How does that strategy help to form a "virtuous cycle" of improving the open source foundation?What are the most interesting, innovative, or unexpected ways that you have seen dlt used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?When is dlt the wrong choice?What do you have planned for the future of dlt/dlthub?Contact Info AdrianLinkedInMarcinLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links dltPodcast EpisodePyArrowPolarsIbisDuckDBPodcast Episodedlt Data ContractsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodePyAirbyteOpenAI o1 ModelLanceDBQDrant EmbeddedAirflowGitHub ActionsArrow DataFusionApache ArrowPyIcebergDelta-RSSCD2 == Slowly Changing DimensionsSQLAlchemySQLGlotFSSpecPydanticSpacyEntity RecognitionParquet File FormatPython DecoratorREST API ToolkitOpenAPI Connector GeneratorConnectorXPython no-GILDelta LakePodcast EpisodeSQLMeshPodcast EpisodeHamiltonTabularPostHogPodcast.init EpisodeAsyncIOCursor.AIData MeshPodcast EpisodeFastAPILangChainGraphRAGAI Engineering Podcast EpisodeProperty GraphPython uvThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Engineering Best Practices

2024-10-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Richard J. Schiller , David Larochelle

Agile/Scrum AI/ML Analytics Big Data Cloud Computing ETL/ELT data data-engineering

Unlock the secrets to building scalable and efficient data architectures with 'Data Engineering Best Practices.' This book provides in-depth guidance on designing, implementing, and optimizing cloud-based data pipelines. You will gain valuable insights into best practices, agile workflows, and future-proof designs. What this Book will help me do Effectively plan and architect scalable data solutions leveraging cloud-first strategies. Master agile processes tailored to data engineering for improved project outcomes. Implement secure, efficient, and reliable data pipelines optimized for analytics and AI. Apply real-world design patterns and avoid common pitfalls in data flow and processing. Create future-ready data engineering solutions following industry-proven frameworks. Author(s) Richard J. Schiller and David Larochelle are seasoned data engineering experts with decades of experience crafting efficient and secure cloud-based infrastructures. Their collaborative writing distills years of real-world expertise into practical advice aimed at helping engineers succeed in a rapidly evolving field. Who is it for? This book is ideal for data engineers, ETL specialists, and big data professionals seeking to enhance their knowledge in cloud-based solutions. Some familiarity with data engineering, ETL pipelines, and big data technologies is helpful. It suits those keen on mastering advanced practices, improving agility, and developing efficient data pipelines. Perfect for anyone looking to future-proof their skills in data engineering.

Navnit Shukla - Data Wrangling and Architecting Solutions on AWS, Writing Books, and More

2024-10-09 · The Joe Reis Show Listen

podcast_episode

by Navnit Shukla (AWS) , Joe Reis (DeepLearning.AI)

AWS

Navnit Shukla is a solutions architect with AWS. He joins me to chat about data wrangling and architecting solutions on AWS, writing books, and much more.

Navnit is also in the Coursera Data Engineering Specialization, dropping knowledge on data engineering on AWS. Check it out!

Data Wrangling on AWS: https://www.amazon.com/Data-Wrangling-AWS-organize-analysis/dp/1801810907

LinkedIn: https://www.linkedin.com/in/navnitshukla/

Financial Data Engineering

2024-10-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tamer Khraisha

API Data Governance data data-engineering

Today, investment in financial technology and digital transformation is reshaping the financial landscape and generating many opportunities. Too often, however, engineers and professionals in financial institutions lack a practical and comprehensive understanding of the concepts, problems, techniques, and technologies necessary to build a modern, reliable, and scalable financial data infrastructure. This is where financial data engineering is needed. A data engineer developing a data infrastructure for a financial product possesses not only technical data engineering skills but also a solid understanding of financial domain-specific challenges, methodologies, data ecosystems, providers, formats, technological constraints, identifiers, entities, standards, regulatory requirements, and governance. This book offers a comprehensive, practical, domain-driven approach to financial data engineering, featuring real-world use cases, industry practices, and hands-on projects. You'll learn: The data engineering landscape in the financial sector Specific problems encountered in financial data engineering The structure, players, and particularities of the financial data domain Approaches to designing financial data identification and entity systems Financial data governance frameworks, concepts, and best practices The financial data engineering lifecycle from ingestion to production The varieties and main characteristics of financial data workflows How to build financial data pipelines using open source tools and APIs Tamer Khraisha, PhD, is a senior data engineer and scientific author with more than a decade of experience in the financial sector.

Build Your Data Transformations Faster And Safer With SDF

2024-10-06 · Data Engineering Podcast Listen

podcast_episode

by Lukas Schulte (SDF) , Tobias Macey

Data Management Datafold dbt Python SQL SQLMesh

Summary In this episode of the Data Engineering Podcast Lukas Schulte, co-founder and CEO of SDF, explores the development and capabilities of this fast and expressive SQL transformation tool. From its origins as a solution for addressing data privacy, governance, and quality concerns in modern data management, to its unique features like static analysis and type correctness, Lucas dives into what sets SDF apart from other tools like DBT and SQL Mesh. Tune in for insights on building a business around a developer tool, the importance of community and user experience in the data engineering ecosystem, and plans for future development, including supporting Python models and enhancing execution capabilities. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm interviewing Lukas Schulte about SDF, a fast and expressive SQL transformation tool that understands your schemaInterview IntroductionHow did you get involved in the area of data management?Can you describe what SDF is and the story behind it?What's the story behind the name?What problem are you solving with SDF?dbt has been the dominant player for SQL-based transformations for several years, with other notable competition in the form of SQLMesh. Can you give an overview of the venn diagram for features and functionality across SDF, dbt and SQLMesh?Can you describe the design and implementation of SDF?How have the scope and goals of the project changed since you first started working on it?What does the development experience look like for a team working with SDF?How does that differ between the open and paid versions of the product?What are the features and functionality that SDF offers to address intra- and inter-team collaboration?One of the challenges for any second-mover technology with an established competitor is the adoption/migration path for teams who have already invested in the incumbent (dbt in this case). How are you addressing that barrier for SDF?Beyond the core migration path of the direct functionality of the incumbent product is the amount of tooling and communal knowledge that grows up around that product. How are you thinking about that aspect of the current landscape?What is your governing principle for what capabilities are in the open core and which go in the paid product?What are the most interesting, innovative, or unexpected ways that you have seen SDF used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on SDF?When is SDF the wrong choice?What do you have planned for the future of SDF?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links SDFSemantic Data Warehouseasdf-vmdbtSoftware Linting)SQLMeshPodcast EpisodeCoalescePodcast EpisodeApache IcebergPodcast EpisodeDuckDB Podcast Episode SDF Classifiersdbt Semantic Layerdbt expectationsApache DatafusionIbisThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Engineering Central Podcast - 02

2024-10-04 · Data Engineering Central Podcast Listen

podcast_episode

ETL/ELT LLM Rust SaaS

Welcome to the Data Engineering Central Podcast —— a no-holds-barred discussion on the Data Landscape. Welcome to Episode 02 In today’s episode, we will talk about the following topics from the Data Engineering perspective … * Using OpenAI’s o1 Model to do Data Engineering work * Lord Save us from more ETL tools * Rust for the small things * Hosted (SaaS) vs Build

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

#249 Towards Self-Service Data Engineering with Taylor Brown, Co-Founder and COO at Fivetran

2024-10-03 · DataFramed Listen

podcast_episode

by Taylor Brown (Fivetran) , Richie (DataCamp)

AI/ML Fivetran Modern Data Stack

The sheer number of tools and technologies that can infiltrate your work processes can be overwhelming. Choosing the right ones to invest in is critical, but how do you know where to start? What steps should you take to build a solid, scalable data infrastructure that can handle the growth of your business? And with AI becoming a central focus for many organizations, how can you ensure that your data strategy is aligned to support these initiatives? It’s no longer just about managing data; it’s about future-proofing your organization. Taylor Brown is the COO and Co-Founder of Fivetran, the global leader in data movement. With a vision to simplify data connectivity and accessibility, Taylor has been instrumental in transforming the way organizations manage their data infrastructure. Fivetran has grown rapidly, becoming a trusted partner for thousands of companies worldwide. Taylor's expertise in technology and business strategy has positioned Fivetran at the forefront of the data integration industry, driving innovation and empowering businesses to harness the full potential of their data. Prior to Fivetran, Taylor honed his skills in various tech startups, bringing a wealth of experience and a passion for problem-solving to his entrepreneurial ventures. In the episode, Richie and Taylor explore the biggest challenges in data engineering, how to find the right tools for your data stack, defining the modern data stack, federated data, data fabrics, data meshes, data strategy vs organizational structure, self-service data, data democratization, AI’s impact on data and much more. Links Mentioned in the Show: FivetranConnect with TaylorCareer Track: Data Engineer in PythonRelated Episode: Effective Data Engineering with Liya Aizenberg, Director of Data Engineering at AwayRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

2024-09-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pavan Kumar Narayanan

AI/ML Airflow Analytics API AWS Azure Cloud Computing Data Analytics Data Quality GCP Kafka Microsoft +9 more

This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code. The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows. What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world. What You Will Learn Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure Who This Book Is For Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists

Scaling Airbyte: Challenges and Milestones on the Road to 1.0

2024-09-23 · Data Engineering Podcast Listen

podcast_episode

by Michel Tricot (Airbyte) , Tobias Macey

AI/ML Airbyte Cloud Computing Data Management GenAI Modern Data Stack Python

Summary Airbyte is one of the most prominent platforms for data movement. Over the past 4 years they have invested heavily in solutions for scaling the self-hosted and cloud operations, as well as the quality and stability of their connectors. As a result of that hard work, they have declared their commitment to the future of the platform with a 1.0 release. In this episode Michel Tricot shares the highlights of their journey and the exciting new capabilities that are coming next. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYour host is Tobias Macey and today I'm interviewing Michel Tricot about the journey to the 1.0 launch of Airbyte and what that means for the projectInterview IntroductionHow did you get involved in the area of data management?Can you describe what Airbyte is and the story behind it?What are some of the notable milestones that you have traversed on your path to the 1.0 release?The ecosystem has gone through some significant shifts since you first launched Airbyte. How have trends such as generative AI, the rise and fall of the "modern data stack", and the shifts in investment impacted your overall product and business strategies?What are some of the hard-won lessons that you have learned about the realities of data movement and integration?What are some of the most interesting/challenging/surprising edge cases or performance bottlenecks that you have had to address?What are the core architectural decisions that have proven to be effective?How has the architecture had to change as you progressed to the 1.0 release?A 1.0 version signals a degree of stability and commitment. Can you describe the decision process that you went through in committing to a 1.0 version?What are the most interesting, innovative, or unexpected ways that you have seen Airbyte used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airbyte?When is Airbyte the wrong choice?What do you have planned for the future of Airbyte after the 1.0 launch?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AirbytePodcast EpisodeAirbyte CloudAirbyte Connector BuilderSinger ProtocolAirbyte ProtocolAirbyte CDKModern Data StackELTVector DatabasedbtFivetranPodcast EpisodeMeltanoPodcast EpisodedltReverse ETLGraphRAGAI Engineering Podcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5 Minute Friday - Uncle Rico

2024-09-20 · The Joe Reis Show Listen

podcast_episode

by Joe Reis (DeepLearning.AI)

AI/ML Data Management

Uncle Rico is a character in the movie Napoleon Dynamite, who is stuck in the past, reminiscing about his days as a high school football star. If only he'd won the game and went to the state championship. Some of the data industry reminds me of Uncle Rico.

During a recent panel, there was a question about whether AI can help with data management (governance, modeling, etc).

Some people were quick to dismiss this, saying that machines are no substitute for humans in their understanding and translating of "the business" to data.

Yet why are we still perpetually stuck in the mode of "80% of data projects fail"? Might AI/ML help data management move out of its rut? Or will it stay stuck in the past?

Also, please check out my new data engineering course on Coursera!

https://www.coursera.org/learn/intro-to-data-engineering

Data Engineering & Data Science Careers (w/ Joe Reis)

2024-09-20 · Mavens of Data Listen

podcast_episode

by Joe Reis (DeepLearning.AI)

AI/ML Analytics Data Science

If you're working on or trying to break into a career in Data Science or Data Engineering, this one is for you. In this episode, Data Engineering expert and recovering Data Scientist Joe Reis shares some of his best tips and strategies for folks looking to launch or accelerate their data careers. You'll leave with practical and actionable advice that you can use to take your career to the next level. What You'll Learn: Key differences between Analytics, Data Science, and Data Engineering The top skills and tools to focus on for each of these career paths How rapidly changing technology like AI is impacting the future of data jobs Register for free to be part of the next live session: https://bit.ly/3XB3A8b About our guest: Joe Reis is a "recovering data scientist" and the co-founder & CEO of Ternary Data. Joe's newest course Fundamentals of Data Engineering Book Follow Joe on LinkedIn

Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter

The State of Data Engineering

2024-09-19 · Big Data LDN 2024

Face To Face

by Jesse Anderson (Big Data Institute)

Data Science DWH

The data landscape is fickle, and once-coveted roles like "DBA" and "Data Scientist" have faced challenges. Now, the spotlight shines on Data Engineers, but will they suffer the same fate?

Thistalk dives into historical trends.

In the early 2010’s, DBA/data warehouse was the sexiest job. Data Warehouse became the “No Team.”

In the mid-2010’s, data scientist was the sexiest job. Data Science became the “mistaken for” team.

Now, data engineering is the sexiest job. Data Engineering became the “confused team”. The confusion run rampant with questions about the industry: What is a data engineer? What do they do? Should we have all kinds of nuanced titles for variations? Just how technical should they be?

Together, let’s go back to history and look for ways on how data engineering can avoid the same fate as data warehousing and data science.

This talk provides a thought-provoking discussion on navigating the exciting yet challenging world of data engineering. Let's avoid the pitfalls of the past and shape a future where data engineers thrive as essential drivers of innovation and success.

Main Takeaways:

● We need to look back on the history of data teams to avoid their mistakes

● Data Engineering is following the same mistakes as Data Science and Data Warehousing

● Learn the actionable insights to help data engineering avoid similar fates

Building Hyper-Personalized LLM Applications with Rich Contextual Data

2024-09-19 · Big Data LDN 2024

Face To Face

by Cillian O'Shea , Jon Varley

AI/ML LLM Python RAG Data Streaming

In the era of AI-driven applications, personalization is paramount. This talk explores the concept of Full RAG (Retrieval-Augmented Generation) and its potential to revolutionize user experiences across industries. We examine four levels of context personalization, from basic recommendations to highly tailored, real-time interactions.

The presentation demonstrates how increasing levels of context - from batch data to streaming and real-time inputs - can dramatically improve AI model outputs. We discuss the challenges of implementing sophisticated context personalization, including data engineering complexities and the need for efficient, scalable solutions.

Introducing the concept of a Context Platform, we showcase how tools like Tecton can simplify the process of building, deploying, and managing personalized context at scale. Through practical examples in travel recommendations, we illustrate how developers can easily create and integrate batch, streaming, and real-time context using simple Python code, enabling more engaging and valuable AI-powered experiences.

talk-data.com

Activity Trend

Top Events

Top Speakers

Data Engineering Central Podcast - 03

Coalesce 2024: How Cox Automotive turbocharged data engineering with ELT

Coalesce 2024: Scaling data transformations with dbt: Symend's journey towards efficiency

Coalesce 2024: From Core to Cloud: Unlocking dbt at Warner Brothers Discovery (CNN)

Coalesce 2024: Securing data access with dbt lineage and dbt grants

Coalesce 2024: How SurveyMonkey sharpens dbt performance and governance with data observability

#252 Is Big Data Dead? MotherDuck and the Small Data Manifesto with Ryan Boyd Co-Founder at MotherDuck

The Role of Python in Shaping the Future of Data Platforms with DLT

Data Engineering Best Practices

Navnit Shukla - Data Wrangling and Architecting Solutions on AWS, Writing Books, and More

Financial Data Engineering

Build Your Data Transformations Faster And Safer With SDF

Data Engineering Central Podcast - 02

#249 Towards Self-Service Data Engineering with Taylor Brown, Co-Founder and COO at Fivetran

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

Scaling Airbyte: Challenges and Milestones on the Road to 1.0

5 Minute Friday - Uncle Rico

Data Engineering & Data Science Careers (w/ Joe Reis)

The State of Data Engineering

Building Hyper-Personalized LLM Applications with Rich Contextual Data