AWS re:Invent 2024 - Innovations in AWS analytics: Zero-ETL and data integrations (ANT348)

2024-12-06 · AWS re:Invent 2024 Watch

video

by Harshida Patel (AWS) , Paul Van Liew (Motive) , Jyoti Aggarwal (AWS)

Agile/Scrum AI/ML Analytics AWS Cloud Computing GenAI

Join this session to learn how AWS analytics services can help you achieve your data integration goals with exceptional price performance. Explore new capabilities, like zero-ETL integrations, that allow your users to access all their data; easily prepare it for analytics, machine learning, and generative AI workloads; build and maintain scalable and resilient data pipelines; and enhance decision-making quality.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Deep dive into Amazon Aurora and its innovations (DAT405)

2024-12-05 · AWS re:Invent 2024 Watch

video

by Grant McAlister (AWS)

Agile/Scrum AI/ML AWS Aurora Cloud Computing GenAI MySQL Redshift postgresql

With an innovative architecture that decouples compute from storage and advanced features like Global Database and low-latency read replicas, Amazon Aurora reimagines what it means to be a relational database. Aurora is a modern database service offering unparalleled performance and high availability at scale with full open source MySQL and PostgreSQL compatibility. In this session, dive deep into the most exciting new features that Aurora offers, including Aurora Limitless Database, Aurora I/O-Optimized, Aurora zero-ETL integration with Amazon Redshift, and Aurora Serverless v2. Additionally, learn how the addition of the pgvector extension allows for the storage of vector embeddings and support of vector similarity searches for generative AI.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Accelerate value from data: Migrating from batch to stream processing (ANT324)

2024-12-04 · AWS re:Invent 2024 Watch

video

by Usama Naseem (Amazon MSK, AWS) , Sofia Zilberman (AWS)

Agile/Scrum AI/ML Analytics AWS Cloud Computing DWH GenAI Redshift Data Streaming

Growing business needs for incorporating real-time insights into conventional use cases is pushing the data transformation envelope from batch processing to streaming. From gaming to clickstream to generative AI use cases, batch analytical workloads today want high throughput, low latency, and simplified ingestion mechanisms for real-time insights and visualizations. Join this session to hear from experts on how to successfully migrate from batch to stream processing using AWS streaming services that provide scalable integrations and real-time capabilities across services such as Amazon Redshift for real-time data warehousing analytics and ELT pipelines.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

The Art of Database Selection and Evolution

2024-12-01 · Data Engineering Podcast Listen

podcast_episode

by Sam Kleinman , Tobias Macey

AI/ML Data Engineering Data Management Datafold DuckDB Lance MongoDB MySQL Python postgresql

Summary In this episode of the Data Engineering Podcast Sam Kleinman talks about the pivotal role of databases in software engineering. Sam shares his journey into the world of data and discusses the complexities of database selection, highlighting the trade-offs between different database architectures and how these choices affect system design, query performance, and the need for ETL processes. He emphasizes the importance of understanding specific requirements to choose the right database engine and warns against over-engineering solutions that can lead to increased complexity. Sam also touches on the tendency of engineers to move logic to the application layer due to skepticism about database longevity and advises teams to leverage database capabilities instead. Finally, he identifies a significant gap in data management tooling: the lack of easy-to-use testing tools for database interactions, highlighting the need for better testing paradigms to ensure reliability and reduce bugs in data-driven applications.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIt’s 2024, why are we still doing data migrations by hand? Teams spend months—sometimes years—manually converting queries and validating data, burning resources and crushing morale. Datafold's AI-powered Migration Agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today to learn how Datafold can automate your migration and ensure source to target parity. Your host is Tobias Macey and today I'm interviewing Sam Kleinman about database tradeoffs across operating environments and axes of scaleInterview IntroductionHow did you get involved in the area of data management?The database engine you use has a substantial impact on how you architect your overall system. When starting a greenfield project, what do you see as the most important factor to consider when selecting a database?points of friction introduced by database capabilitiesembedded databases (e.g. SQLite, DuckDB, LanceDB), when to use and when do they become a bottlenecksingle-node database engines (e.g. Postgres, MySQL), when are they legitimately a problemdistributed databases (e.g. CockroachDB, PlanetScale, MongoDB)polyglot storage vs. general-purpose/multimodal databasesfederated queries, benefits and limitations ease of integration vs. variability of performance and access control Contact Info LinkedInGitHubParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links MongoDBNeonPodcast EpisodeGlareDBNoSQLS3 Conditional WriteEvent driven architectureCockroachDBCouchbaseCassandraThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Engineering with AWS Cookbook

2024-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Viquar Khan , Trâm Ngọc Phạm , Gonzalo Herreros González , Huda Nofal

Analytics Athena AWS Amazon EMR AWS Glue Big Data Cloud Computing Data Engineering Data Lake QuickSight Redshift data +1 more

Data Engineering with AWS Cookbook serves as a comprehensive practical guide for building scalable and efficient data engineering solutions using AWS. With this book, you will master implementing data lakes, orchestrating data pipelines, and creating serving layers using AWS's robust services, such as Glue, EMR, Redshift, and Athena. With hands-on exercises and practical recipes, you will enhance your AWS-based data engineering projects. What this Book will help me do Gain the skills to design centralized data lake solutions and manage them securely at scale. Develop expertise in crafting data pipelines with AWS's ETL technologies like Glue and EMR. Learn to implement and automate governance, orchestration, and monitoring for data platforms. Build high-performance data serving layers using AWS analytics tools like Redshift and QuickSight. Effectively plan and execute data migrations to AWS from on-premises infrastructure. Author(s) Trâm Ngọc Phạm, Gonzalo Herreros González, Viquar Khan, and Huda Nofal bring together years of collective experience in data engineering and AWS cloud solutions. Each author's deep knowledge and passion for cloud technology have shaped this book into a valuable resource, geared towards practical learning and real-world application. Their approach ensures readers are not just learning but building tangible, impactful solutions. Who is it for? This book is geared towards data engineers and big data professionals engaged in or transitioning to cloud-based environments, specifically on AWS. Ideal readers are those looking to optimize workflows and master AWS tools to create scalable, efficient solutions. The content assumes a basic familiarity with AWS concepts like IAM roles and a command-line interface, ensuring all examples are accessible yet meaningful for those seeking advancement in AWS data engineering.

Bridging Code and UI in Data Orchestration with Kestra

2024-11-26 · Data Engineering Podcast Listen

podcast_episode

by Anna Geller , Tobias Macey

AI/ML Analytics API CI/CD Collibra Data Engineering Data Management Datafold Kestra Kubernetes

Summary In this episode of the Data Engineering Podcast, Anna Geller talks about the integration of code and UI-driven interfaces for data orchestration. Anna defines data orchestration as automating the coordination of workflow nodes that interact with data across various business functions, discussing how it goes beyond ETL and analytics to enable real-time data processing across different internal systems. She explores the challenges of using existing scheduling tools for data-specific workflows, highlighting limitations and anti-patterns, and discusses Kestra's solution, a low-code orchestration platform that combines code-driven flexibility with UI-driven simplicity. Anna delves into Kestra's architectural design, API-first approach, and pluggable infrastructure, and shares insights on balancing UI and code-driven workflows, the challenges of open-core business models, and innovative user applications of Kestra's platform.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us you should listen to Data Citizens® Dialogues, the forward-thinking podcast from the folks at Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to Data Citizens Dialogues on Apple, Spotify, Youtube, or wherever you get your podcasts.Your host is Tobias Macey and today I'm interviewing Anna Geller about incorporating both code and UI driven interfaces for data orchestrationInterview IntroductionHow did you get involved in the area of data management?Can you start by sharing a definition of what constitutes "data orchestration"?There are many orchestration and scheduling systems that exist in other contexts (e.g. CI/CD systems, Kubernetes, etc.). Those are often adapted to data workflows because they already exist in the organizational context. What are the anti-patterns and limitations that approach introduces in data workflows?What are the problems that exist in the opposite direction of using data orchestrators for CI/CD, etc.?Data orchestrators have been around for decades, with many different generations and opinions about how and by whom they are used. What do you see as the main motivation for UI vs. code-driven workflows?What are the benefits of combining code-driven and UI-driven capabilities in a single orchestrator?What constraints does it necessitate to allow for interoperability between those modalities?Data Orchestrators need to integrate with many external systems. How does Kestra approach building integrations and ensure governance for all their underlying configurations?Managing workflows at scale across teams can be challenging in terms of providing structure and visibility of dependencies across workflows and teams. What features does Kestra offer so that all pipelines and teams stay organised?What are

An Opinionated Look At End-to-end Code Only Analytical Workflows With Bruin

2024-11-11 · Data Engineering Podcast Listen

podcast_episode

by Burak Karakan (Bruin) , Tobias Macey

AI/ML Analytics Dashboard Data Engineering Data Management Datafold Modern Data Stack Python SQL

Summary The challenges of integrating all of the tools in the modern data stack has led to a new generation of tools that focus on a fully integrated workflow. At the same time, there have been many approaches to how much of the workflow is driven by code vs. not. Burak Karakan is of the opinion that a fully integrated workflow that is driven entirely by code offers a beneficial and productive means of generating useful analytical outcomes. In this episode he shares how Bruin builds on those opinions and how you can use it to build your own analytics without having to cobble together a suite of tools with conflicting abstractions.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm interviewing Burak Karakan about the benefits of building code-only data systemsInterview IntroductionHow did you get involved in the area of data management?Can you describe what Bruin is and the story behind it?Who is your target audience?There are numerous tools that address the ETL workflow for analytical data. What are the pain points that you are focused on for your target users?How does a code-only approach to data pipelines help in addressing the pain points of analytical workflows?How might it act as a limiting factor for organizational involvement?Can you describe how Bruin is designed?How have the design and scope of Bruin evolved since you first started working on it?You call out the ability to mix SQL and Python for transformation pipelines. What are the components that allow for that functionality?What are some of the ways that the combination of Python and SQL improves ergonomics of transformation workflows?What are the key features of Bruin that help to streamline the efforts of organizations building analytical systems?Can you describe the workflow of someone going from source data to warehouse and dashboard using Bruin and Ingestr?What are the opportunities for contributions to Bruin and Ingestr to expand their capabilities?What are the most interesting, innovative, or unexpected ways that you have seen Bruin and Ingestr used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bruin?When is Bruin the wrong choice?What do you have planned for the future of Bruin?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BruinFivetranStitchIngestrBruin CLIMeltanoSQLGlotdbtSQLMeshPodcast EpisodeSDFPodcast EpisodeAirflowDagsterSnowparkAtlanEvidenceThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Coalesce 2024: How Docusign migrated legacy SQL or stored procedures to dbt models

2024-10-17 · Dbt Coalesce 2024 Watch

video

by Bishal Gupta (Docusign)

Analytics Cloud Computing Data Quality dbt SQL

As organizations evolve, many still rely on legacy SQL queries and stored procedures that can become bottlenecks in scaling data infrastructure. In this talk, we will explore how to modernize these workflows by migrating legacy SQL and stored procedures into dbt models, enabling more efficient, scalable, and version-controlled data transformations. We’ll discuss practical strategies for refactoring complex logic, ensuring data lineage, data quality and unit testing benefits, and improving collaboration among teams. This session is ideal for data and analytics engineers, analysts, and anyone looking to optimize their ETL workflows using dbt.

Speaker: Bishal Gupta

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: Generative AI driven near-real-time operational analytics with zero-ETL and dbt Cloud

2024-10-17 · Dbt Coalesce 2024 Watch

video

by Neeraja Rentachintala (Amazon) , Neela Kulkarni (AWS)

AI/ML Analytics AWS Aurora CI/CD Cloud Computing Data Governance dbt GenAI Redshift

AWS offers the most scalable, highest performing data services to keep up with the growing volume and velocity of data to help organizations to be data-driven in real-time. AWS helps customers unify diverse data sources by investing in a zero ETL future and enable end-to-end data governance so your teams are free to move faster with data. Data teams running dbt Cloud are able to deploy analytics code, following software engineering best practices such as modularity, continuous integration and continuous deployment (CI/CD), and embedded documentation. In this session, we will dive deeper into how to get near real-time insight on petabytes of transaction data using Amazon Aurora zero-ETL integration with Amazon Redshift and dbt Cloud for your Generative AI workloads.

Speakers: Neela Kulkarni Solutions Architect AWS

Neeraja Rentachintala Director, Product Management Amazon

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: How Cox Automotive turbocharged data engineering with ELT

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Somnath Chatterjee (Cox Automotive) , Brett Darcy (Cox Automotive)

Alteryx Analytics Cloud Computing Data Engineering dbt

A migration story with the underlying philosophy, and strategic approach to move from a low-code ETL tool like Alteryx to a modern data engineering mindset with dbt. This transition is not just about adopting new tools but embracing a code-first philosophy that promotes best practices in software engineering, such as modularity, reusability, and transparency.

Speakers: Somnath Chatterjee Lead Data Engineer Cox Automotive

Brett Darcy Lead Software Engineer Cox Automotive

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: How USAA moves quickly and manages vulnerabilities with dbt Cloud

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Kit Alderson (USAA) , Ted Douglass (USAA)

Analytics CI/CD Cloud Computing dbt Python

As part of a rapid modernization initiative, USAA Property and Casualty migrated from a legacy, GUI-based ETL tool and on-prem servers to dbt Cloud and a cloud database. Adopting dbt Cloud enabled near real time data delivery, but dbt Python models opened the door to dependency management. In this session, USAA shares how to go fast and manage vulnerabilities.

Speakers: Kit Alderson USAA Data Engineer, CI/CD Wizard USAA

Ted Douglass Senior Data Engineer USAA

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Data Engineering Best Practices

2024-10-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Richard J. Schiller , David Larochelle

Agile/Scrum AI/ML Analytics Big Data Cloud Computing Data Engineering data data-engineering

Unlock the secrets to building scalable and efficient data architectures with 'Data Engineering Best Practices.' This book provides in-depth guidance on designing, implementing, and optimizing cloud-based data pipelines. You will gain valuable insights into best practices, agile workflows, and future-proof designs. What this Book will help me do Effectively plan and architect scalable data solutions leveraging cloud-first strategies. Master agile processes tailored to data engineering for improved project outcomes. Implement secure, efficient, and reliable data pipelines optimized for analytics and AI. Apply real-world design patterns and avoid common pitfalls in data flow and processing. Create future-ready data engineering solutions following industry-proven frameworks. Author(s) Richard J. Schiller and David Larochelle are seasoned data engineering experts with decades of experience crafting efficient and secure cloud-based infrastructures. Their collaborative writing distills years of real-world expertise into practical advice aimed at helping engineers succeed in a rapidly evolving field. Who is it for? This book is ideal for data engineers, ETL specialists, and big data professionals seeking to enhance their knowledge in cloud-based solutions. Some familiarity with data engineering, ETL pipelines, and big data technologies is helpful. It suits those keen on mastering advanced practices, improving agility, and developing efficient data pipelines. Perfect for anyone looking to future-proof their skills in data engineering.

Data Engineering Central Podcast - 02

2024-10-04 · Data Engineering Central Podcast Listen

podcast_episode

Data Engineering LLM Rust SaaS

Welcome to the Data Engineering Central Podcast —— a no-holds-barred discussion on the Data Landscape. Welcome to Episode 02 In today’s episode, we will talk about the following topics from the Data Engineering perspective … * Using OpenAI’s o1 Model to do Data Engineering work * Lord Save us from more ETL tools * Rust for the small things * Hosted (SaaS) vs Build

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Building Large Scale ETL Pipelines with Dask

2024-09-25 · PyData Paris 2024

talk

by Patrick Hoefler

Cloud Computing

Building scalable ETL pipelines and deploying them in the cloud can seem daunting. It shouldn't be. Leveraging proper technologies can make this process easy. We will discuss the whole process of developing a composable and scalable ETL pipeline centred around Dask that is fully built with Open Source tools and how we can deploy to the cloud.

Lead the GenAI Revolution With Automated Data Warehouse and Analytics Modernisation

2024-09-18 · Big Data LDN 2024

Face To Face

by Barry Lloyd

Analytics BI Cloud Computing DWH GenAI Hadoop

Overcome the limitations of your legacy data warehouse or BI systems and reap the benefits of a cloud-native stack with LeapLogic, Impetus’ automated cloud migration accelerator. Join our session to explore how LeapLogic’s end-to-end automated capabilities can fast-track and streamline the transformation of legacy data warehouse, ETL, Hadoop, analytics, and reporting workloads to the cloud. Gain actionable insights from real-world success stories of Fortune 500 enterprises that have successfully modernised their legacy workloads, positioning them at the forefront of the GenAI revolution.

Beyond Dashboards: Building Interactive Data Applications in Hours

2024-09-18 · Big Data LDN 2024

Face To Face

by Vojta Tuma (Keboola)

Accelerate your path from idea to proof of concept. Our platform empowers you to build working data application prototypes at lightning speed, revolutionizing how you bring your business ideas to life. Transcend the limitations of conventional dashboards. Craft data applications with embedded logic and bi-directional control through reverse ETL.

Beyond Dashboards: Building Interactive Data Applications in Hours

2024-09-18 · Exhibitors' Events - Auto created

Face To Face

by Vojta Tuma (Keboola) , Keboola , Field CTO

Transcend the limitations of conventional dashboards. Craft data applications with embedded logic and bi-directional ctrl with reverse ETL

Airflow Unleashed: Making Hundreds of Deployments A Day at Coinbase

2024-07-01 · Airflow Summit 2024

session

by Jianlong Zhong

Airflow

At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity. Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow. Capable of deploying updates hundreds of times a day on both staging and production environments, AirAgent has transformed our development lifecycle, enabling immediate iteration and drastically improving developer velocity. This talk aims to unveil the inner workings of AirAgent, highlighting its design principles, deployment strategies, and the challenges we overcame in its implementation. By sharing our journey, we hope to offer insights and strategies that can benefit others in the Airflow community, encouraging a shift towards a high-frequency deployment workflow.

Architecting Blockchain ETL Orchestration: Circle's Airflow Usecase

2024-07-01 · Airflow Summit 2024

session

by Nathaniel Rose (Circle)

Airflow Blockchain CI/CD

This talk focuses on exploring the implementation of Apache Airflow for Blockchain ETL orchestration, indexing, and the adoption of GitOps at Circle. IT will cover CICD tips, architectural choices for managing Blockchain data at scale, engineering practices to enable data scientists and some learnings from production.

Streamlining a Mortgage ETL Pipeline with Apache Airflow

2024-07-01 · Airflow Summit 2024

session

by Zhang Zhang , Jenny Gao

Airflow

At Bloomberg, it is our team’s responsibility to ensure the timely delivery to our clients worldwide of a vast dataset comprising approximately 5 billion data points on roughly 50 million loans and over 1.4 million securities, disclosed twice a month by three major government-sponsored mortgage entities. Ingesting this data so we can create and derive complex data structures to be consumed by our applications for our clients has been our biggest challenge. In this talk, we will discuss our transition from a manually-managed spreadsheet-based system to an automated centralized orchestration tool, and how Apache Airflow has helped make the process more transparent, predictable, and visible.

talk-data.com

ETL/ELT

Activity Trend

Top Events

Top Speakers

AWS re:Invent 2024 - Innovations in AWS analytics: Zero-ETL and data integrations (ANT348)

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Deep dive into Amazon Aurora and its innovations (DAT405)

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Accelerate value from data: Migrating from batch to stream processing (ANT324)

AWSreInvent #AWSreInvent2024

The Art of Database Selection and Evolution

Data Engineering with AWS Cookbook

Bridging Code and UI in Data Orchestration with Kestra

An Opinionated Look At End-to-end Code Only Analytical Workflows With Bruin

Coalesce 2024: How Docusign migrated legacy SQL or stored procedures to dbt models

Coalesce 2024: Generative AI driven near-real-time operational analytics with zero-ETL and dbt Cloud

Coalesce 2024: How Cox Automotive turbocharged data engineering with ELT

Coalesce 2024: How USAA moves quickly and manages vulnerabilities with dbt Cloud

Data Engineering Best Practices

Data Engineering Central Podcast - 02

Building Large Scale ETL Pipelines with Dask

Lead the GenAI Revolution With Automated Data Warehouse and Analytics Modernisation

Beyond Dashboards: Building Interactive Data Applications in Hours

Beyond Dashboards: Building Interactive Data Applications in Hours

Airflow Unleashed: Making Hundreds of Deployments A Day at Coinbase

Architecting Blockchain ETL Orchestration: Circle's Airflow Usecase

Streamlining a Mortgage ETL Pipeline with Apache Airflow