AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

2025-12-08 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Athena AWS AWS Glue Cloud Computing Data Management Cyber Security

Prepare to revolutionize your data infrastructure for the AI era with Amazon EMR, AWS Glue, and Amazon Athena. This session will guide you through leveraging these powerful AWS services to construct robust, scalable data architectures that empower AI solutions at scale. Gain insights into effective architectural strategies for data processing to build AI applications, optimizing for cost-efficiency and security. Explore architectural frameworks that underpin successful AI-driven data initiatives, and learn from field lessons on how to navigate modernization projects. Whether you’re starting your modernization journey or refining current setups, this session offers practical strategies to fast-track your organization towards achieving excellence in AI-powered data management.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Agentic data engineering with AWS Analytics MCP Servers (ANT335)

2025-12-07 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics Athena AWS AWS Glue Cloud Computing Data Engineering Redshift

In this session, we will introduce AWS Analytics Model Context Protocol (MCP) Servers, including the Data Processing MCP Server and Amazon Redshift MCP Server, which enable agentic workflows across AWS Glue, Amazon EMR, Amazon Athena, and Amazon Redshift. You will learn how these open-source tools simplify complex analytics operations through natural language interactions with AI agents. We'll cover MCP server implementation strategies, real-world use cases, architectural patterns for deployment, and production best practices for building intelligent data engineering workflows that understand and orchestrate your analytics environment.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - iTTi's Cross-Company Data Mesh Blueprint with Amazon SageMaker (ANT342)

2025-12-07 · AWS re:Invent 2024 Watch

video

Agile/Scrum AWS Cloud Computing Data Lake Amazon SageMaker

This session shares the journey of implementing a hybrid Data Mesh architecture within a multi-company holding, balancing centralization and decentralization needs. We will cover how our hybrid approach leverages Amazon EMR on EKS for data lake ingestion and Amazon SageMaker to enable self-service data discovery, data product subscription, and consumption, allowing companies within the group to autonomously explore, access, and utilize data products while maintaining centralized governance. Through ITTI - Grupo Vázquez's real-world experience, attendees will learn how this hybrid data mesh architecture successfully addresses diverse data domains, varying governance requirements, and rapid value delivery needs.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

2025-12-06 · AWS re:Invent 2024 Watch

video

Agile/Scrum Athena AWS AWS Glue Cloud Computing Data Lakehouse ETL/ELT Iceberg Redshift S3 Amazon SageMaker Spark +1 more

Discover advanced strategies for implementing Apache Iceberg on AWS, focusing on Amazon S3 Tables and integration of Iceberg Rest Catalog with the lakehouse in Amazon SageMaker. We'll cover performance optimization techniques for Amazon Athena and Amazon Redshift queries, real-time processing using Apache Spark, and integration with Amazon EMR, AWS Glue, and Trino. Explore practical implementations of zero-ETL, change data capture (CDC) patterns, and medallion architecture. Gain hands-on expertise in implementing enterprise-grade lakehouse solutions with Iceberg on AWS.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

2025-12-04 · AWS re:Invent 2024 Watch

video

Agile/Scrum Airflow Analytics Athena AWS AWS Glue Big Data Cloud Computing

Explore the latest advancements in AWS Analytics designed to transform your data processing landscape. This session unveils powerful new capabilities across key services, including Amazon EMR for scalable big data processing, AWS Glue for seamless data integration, Amazon Athena for optimized querying, and Amazon Managed Workflows for Apache Airflow (MWAA) for workflow orchestration. Discover how these innovations can supercharge performance, optimize costs, and streamline your data ecosystem. Whether you're looking to enhance scalability, improve data integration, accelerate queries, or refine workflow management, join us to gain actionable insights that will position your organization at the forefront of data processing innovation.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

2025-12-02 · AWS re:Invent 2024 Watch

video

Agile/Scrum AWS AWS Glue Cloud Computing ETL/ELT S3 Amazon SageMaker Spark

Apache Spark on AWS Glue, Amazon EMR, and Amazon SageMaker enhances the optimization of large-scale data processing workloads. These include faster read and write throughput, accelerated processing of common file formats, and expanded Amazon S3 support through the S3A protocol for greater flexibility in write operations. In this session, we'll explore recent enhancements in Spark for distributed computation and in-memory storage to enable efficient data aggregation and job optimization. We'll also demonstrate how these innovations, combined with Spark's native capabilities, strengthen governance and encryption to help you optimize performance while maintaining control and compliance. Join us to learn how to build unified, secure, and high-performance ETL pipelines on AWS using Spark.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Tim Kessler (Redox, Inc.) , Matthew Giglia (Databricks)

AI/ML API BI Data Lakehouse Databricks Delta ETL/ELT SQL Data Streaming

Redox & Databricks direct integration can streamline your interoperability workflows from responding in record time to preauthorization requests to letting attending physicians know about a change in risk for sepsis and readmission in near real time from ADTs. Data engineers will learn how to create fully-streaming ETL pipelines for ingesting, parsing and acting on insights from Redox FHIR bundles delivered directly to Unity Catalog volumes. Once available in the Lakehouse, AI/BI Dashboards and Agentic Frameworks help write FHIR messages back to Redox for direct push down to EMR systems. Parsing FHIR bundle resources has never been easier with SQL combined with the new VARIANT data type in Delta and streaming table creation against Serverless DBSQL Warehouses. We'll also use Databricks accelerators dbignite and redoxwrite for writing and posting FHIR bundles back to Redox integrated EMRs and we'll extend AI/BI with Unity Catalog SQL UDFs and the Redox API for use in Genie.

Iceberg Table Format Adoption and Unified Metadata Catalog Implementation in Lakehouse Platform

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Ruotian Wang (Doordash) , Sergey Zavgorodni (DoorDash)

Data Lake Data Lakehouse Databricks DWH Iceberg Snowflake Trino

DoorDash Data organization actively adopts LakeHouse paradigm. This presentation describes the methodology which allows to migrate the classic Data Warehouse and Data Lake platforms to unified LakeHouse solution.The objective of this effort include Elimination of excessive data movement.Seamless integration and consolidation of the query engine layers, including Snowflake, Databricks, EMR and Trino.Query performance optimization.Abstracting away complexity of underlying storage layers and table formatsStrategic and justified decision on the Unified Metadata catalog used across varios compute platforms

Toyota: Maximizing Business Value and Ensuring Data Privacy with Databricks in Connected Vehicles

2025-06-10 · Data + AI Summit 2025

talk

by Yoshihiro Oe (TOYOTA MOTOR CORPORATION) , Satoshi Kuramitsu (Databricks)

Databricks Delta

As global data privacy regulations tighten, balancing user data protection with maximizing its business value is crucial.This presentation explores how integrating Databricks into our connected-vehicle data platform enhances both governance and business outcomes. We’ll highlight a case where migrating from EMR to Databricks improved deletion performance and cut costs by 99% with Delta Lake. This shift not only ensures compliance with data-privacy regulations but also maximizes the potential of connected-vehicle data. We are developing a platform that balances compliance with business value and sets a global standard for data usage, inviting partners to join us in building a secure, efficient mobility ecosystem.

AWS re:Invent 2024 - Customer Keynote Autodesk

2025-01-06 · AWS re:Invent 2024 Watch

video

by Raji Arasu (Autodesk)

Agile/Scrum AI/ML AWS Cloud Computing DynamoDB ELK GenAI Fabric Amazon SageMaker

Design software pioneer Autodesk is transforming computer-aided design (CAD) by harnessing generative AI and Amazon Web Services (AWS). The company is developing advanced AI foundation models, like "Project Bernini," which can generate precise 2D and 3D geometric designs based on physical principles.

By utilizing AWS technologies such as Amazon DynamoDB, Elastic MapReduce (EMR), Amazon SageMaker, and Elastic Fabric Adapter, Autodesk has significantly enhanced its AI development process. These innovations have halved foundation model development time and increased AI productivity by 30%.

Learn more about AWS events: https://go.aws/events

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

reInvent2024 #AWSreInvent2024 #AWSEvents

AWS AI and Data Conference Ireland 2024 | AWS Events

2024-12-23 · AWS re:Invent 2024 Watch

video

by Martin Holste (Trellix) , Rick Sears (AWS) , Barry Morris (AWS) , Eddie Wilson (Ryanair)

Agile/Scrum AI/ML Analytics Athena AWS Cloud Computing Data Analytics GenAI

The AWS AI and Data Conference 2024 delivered practical insights on Generative AI, Machine Learning, and Data Analytics. Attendees learned how organizations are using these technologies to scale operations and meet customer needs. AWS experts and customers shared real-world applications across industries. The event covered the latest trends and best practices, including hands-on experience with AWS tools like Amazon Bedrock for AI development. Keynote speakers included Eddie Wilson (CEO, Ryanair), Martin Holste (CTO for Cloud and AI, Trellix), Rick Sears (GM, Amazon Athena, EMR, and Lake Formation, AWS), and Barry Morris (GM, Purpose Built Databases, AWS). Whether new to AI or seasoned professionals, participants gained actionable knowledge to drive innovation in their organizations.

Sign up now for the AWS AI an Data Conference 2025 and stay at the forefront of AI and data innovation: https://go.aws/4gNtNa6

Learn more about AWS events: https://go.aws/events

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSEvents #AWSAI #AWSDataConf #GenerativeAI #CloudInnovation #AmazonBedrock #AIforBusiness #DataDriven #MachineLearning #AWSEvents #TechInnovation

AWS re:Invent 2024 - Cost-effective data processing with Amazon EMR (ANT344)

2024-12-08 · AWS re:Invent 2024 Watch

video

by Aaron Feng (Roblox) , Matthew Liem (AWS)

Agile/Scrum AWS Amazon EC2 Big Data Cloud Computing

Unlock the full potential of your big data environment during this in-depth session on Amazon EMR and cost optimization strategies, tailored for data engineers, data architects, and cloud architects. Gain a comprehensive understanding of various cost optimization strategies, including cluster rightsizing, using Amazon EC2 Spot Instances, and implementing managed scaling. Learn about the key differences between Amazon EMR deployment models and how to choose the best option that aligns with your organization’s specific requirements, constraints, and technical capabilities. Leave with actionable insights and practical strategies to enhance your big data workflows and achieve significant cost savings.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Innovations in AWS analytics: Data processing (ANT346)

2024-12-05 · AWS re:Invent 2024 Watch

video

by William Vambenepe (AWS) , Kinshuk Pahare (AWS) , Craig Suchanec (Bridgewater Associates)

Agile/Scrum Analytics Athena AWS AWS Glue Big Data Cloud Computing

Join this session for an in-depth look at new capabilities to optimize data processing with AWS analytics services. Learn more about using Amazon EMR for scalable big data processing, AWS Glue for seamless data integration, Amazon Athena for powerful querying, and Amazon MWAA for supporting complex workflows. Whether you’re looking to improve performance, reduce costs, or streamline your data pipeline, this session provides valuable insights into the latest functionality and tools needed to enhance your data processing capabilities.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

Data Engineering with AWS Cookbook

2024-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Viquar Khan , Trâm Ngọc Phạm , Gonzalo Herreros González , Huda Nofal

Analytics Athena AWS AWS Glue Big Data Cloud Computing Data Engineering Data Lake ETL/ELT QuickSight Redshift data +1 more

Data Engineering with AWS Cookbook serves as a comprehensive practical guide for building scalable and efficient data engineering solutions using AWS. With this book, you will master implementing data lakes, orchestrating data pipelines, and creating serving layers using AWS's robust services, such as Glue, EMR, Redshift, and Athena. With hands-on exercises and practical recipes, you will enhance your AWS-based data engineering projects. What this Book will help me do Gain the skills to design centralized data lake solutions and manage them securely at scale. Develop expertise in crafting data pipelines with AWS's ETL technologies like Glue and EMR. Learn to implement and automate governance, orchestration, and monitoring for data platforms. Build high-performance data serving layers using AWS analytics tools like Redshift and QuickSight. Effectively plan and execute data migrations to AWS from on-premises infrastructure. Author(s) Trâm Ngọc Phạm, Gonzalo Herreros González, Viquar Khan, and Huda Nofal bring together years of collective experience in data engineering and AWS cloud solutions. Each author's deep knowledge and passion for cloud technology have shaped this book into a valuable resource, geared towards practical learning and real-world application. Their approach ensures readers are not just learning but building tangible, impactful solutions. Who is it for? This book is geared towards data engineers and big data professionals engaged in or transitioning to cloud-based environments, specifically on AWS. Ideal readers are those looking to optimize workflows and master AWS tools to create scalable, efficient solutions. The content assumes a basic familiarity with AWS concepts like IAM roles and a command-line interface, ensuring all examples are accessible yet meaningful for those seeking advancement in AWS data engineering.

Learn how Trellix scales security operations with Amazon Bedrock, AWS EMR | AWS Events

2024-07-29 · AWS re:Invent 2024 Watch

video

by Martin Holste (Trellix) , Rick Sears (AWS)

Agile/Scrum AI/ML Athena AWS Cloud Computing GenAI Cyber Security

Dive into this AWS session with Rick Sears, General Manager of Amazon Athena, EMR, and Lake Formation at AWS, and Martin Holste, CTO Cloud and AI at Trellix. Explore how Trellix uses Amazon Bedrock and AWS EMR to revolutionize security operations. Learn how generative AI and comprehensive data strategies enhance threat detection and automate security processes, driving a new era of efficiency and protection. Discover practical AI applications and real-world examples, and get ready to accelerate your AI journey with AWS.

Speakers: Martin Holste, CTO Cloud and AI, Trellix Rick Sears, General Manager of Amazon Athena, EMR, and Lake Formation, Amazon Web Services

Learn more: https://go.aws/3x2mha0 Learn more about AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSEvents #awsaidataconference #awsgenai

Building an end to end data strategy for analytics and generative AI | AWS Events

2024-07-22 · AWS re:Invent 2024 Watch

video

by Mark Greville (Workhuman) , Kamal Sampathkumar (Workhuman) , Rick Sears (AWS)

Agile/Scrum AI/ML Analytics API Athena AWS Cloud Computing Data Management GenAI

In this session, Rick Sears, General Manager of Amazon Athena, EMR, and Lake Formation at AWS, explores how generative AI is revolutionizing businesses and the critical role data plays in this transformation. He discusses the evolution of AI models and the importance of a comprehensive data management strategy encompassing availability, quality, and protection of data.

Mark Greville, Vice President of Architecture at Workhuman, shares insights from Workhuman's journey in building a robust cloud-based data strategy, emphasizing the significance of storytelling, demonstrating value, and gaining executive support.

Kamal Sampathkumar, Senior Manager of Data Architecture at Workhuman, delves into the technical aspects, detailing the architecture of Workhuman's data platform and showcasing solutions like Data API and self-service reporting that deliver substantial value to customers.

Learn more at: https://go.aws/3x2mha0

Learn more about AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSEvents #awsaianddataconference #generativeaiconference #genaiconference #genaievent #AWSgenerativeai #AWSgenai

Strategies For A Successful Data Platform Migration

2023-07-31 · Data Engineering Podcast Listen

podcast_episode

by Rob Goretsky , Gleb Mezhanskiy (Datafold) , Tobias Macey

AI/ML Airflow Analytics BigQuery Dagster Data Engineering Data Management Data Science Datafold dbt ELK GitHub +9 more

Summary

All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team! Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack

Interview

Introduction How did you get involved in the area of data management? A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation? Is it possible to completely avoid having to invest in a migration? What are the signals that point to the need for a migration?

What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one) What are some signals that a migration is not the right solution for a perceived problem?

Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution? What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)? What are some of the ways that a migration effort might fail? What are the major pitfalls that teams need to be aware of as they work through a data platform migration? What are the opportunities for automation during the migration process? What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations? What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons?

Contact Info

Gleb

LinkedIn @glebmm on Twitter

Rob

LinkedIn RobGoretsky on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Datafold

Podcast Episode

Informatica Airflow Snowflake

Podcast Episode

Redshift Eventbrite Teradata BigQuery Trino EMR == Elastic Map-Reduce Shadow IT

Podcast Episode

Mode Analytics Looker Sunk Cost Fallacy data-diff

Podcast Episode

SQLGlot Dagster dbt

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Hex: Hex Tech Logo

Hex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!Rudderstack:

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackSupport Data Engineering Podcast

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Noritaka Sekiyama (Amazon Web Services (AWS)) , Akira Ajisaka

Athena AWS AWS Glue Amazon RDS Cloud Computing Data Lake Data Lakehouse Databricks Delta DWH DynamoDB MongoDB +3 more

Delta Lake is an open source project that helps implement modern data lake architectures commonly built on cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud.

There are a lot of use cases of Delta tables on AWS. AWS has invested a lot in this technology, and now Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. With AWS Glue, you can easily ingest data from multiple data sources such as on-prem databases, Amazon RDS, DynamoDB, MongoDB into Delta Lake on Amazon S3 even without expertise in coding.

This session will demonstrate how to get started with processing Delta Lake tables on Amazon S3 using AWS Glue, and querying from Amazon Athena, and Amazon Redshift. The session also covers recent AWS service updates related to Delta Lake.

Talk by: Noritaka Sekiyama and Akira Ajisaka

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Deep Dive Into Grammarly's Data Platform

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Christopher Locklin , Faraz Yasrobi

AWS Data Lakehouse Databricks Marketing

Grammarly helps 30 million people and 50,000 teams to communicate more effectively. Using the Databricks Lakehouse Platform, we can rapidly ingest, transform, aggregate, and query complex data sets from an ecosystem of sources, all governed by Unity Catalog. This session will overview Grammarly’s data platform and the decisions that shaped the implementation. We will dive deep into some architectural challenges the Grammarly Data Platform team overcame as we developed a self-service framework for incremental event processing.

Our investment in the lakehouse and Unity Catalog has dramatically improved the speed of our data value chain: making 5 billion events (ingested, aggregated, de-identified, and governed) available to stakeholders (data scientists, business analysts, sales, marketing) and downstream services (feature store, reporting/dashboards, customer support, operations) available within 15. As a result, we have improved our query cost performance (110% faster at 10% the cost) compared to our legacy system on AWS EMR.

I will share architecture diagrams, their implications at scale, code samples, and problems solved and to be solved in a technology-focused discussion about Grammarly’s iterative lakehouse data platform.

Talk by: Faraz Yasrobi and Christopher Locklin

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Migrate Apache Oozie Workflows to Airflow and Run with Amazon EMR

2023-07-01 · Airflow Summit 2023

session

by Dipankar Ghosal

Airflow GitHub

Learn how to convert Oozie Workflows into Airflow DAG and run it on Amazon EMR. The utility supports Airflow 2.4.3. This utility is built on top of https://github.com/GoogleCloudPlatform/oozie-to-airflow

talk-data.com

Amazon EMR

Activity Trend

Top Events

Top Speakers

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Agentic data engineering with AWS Analytics MCP Servers (ANT335)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - iTTi's Cross-Company Data Mesh Blueprint with Amazon SageMaker (ANT342)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

AWSreInvent #AWSreInvent2025 #AWS

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

Iceberg Table Format Adoption and Unified Metadata Catalog Implementation in Lakehouse Platform

Toyota: Maximizing Business Value and Ensuring Data Privacy with Databricks in Connected Vehicles

AWS re:Invent 2024 - Customer Keynote Autodesk

reInvent2024 #AWSreInvent2024 #AWSEvents

AWS AI and Data Conference Ireland 2024 | AWS Events

AWSEvents #AWSAI #AWSDataConf #GenerativeAI #CloudInnovation #AmazonBedrock #AIforBusiness #DataDriven #MachineLearning #AWSEvents #TechInnovation

AWS re:Invent 2024 - Cost-effective data processing with Amazon EMR (ANT344)

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Innovations in AWS analytics: Data processing (ANT346)

AWSreInvent #AWSreInvent2024

Data Engineering with AWS Cookbook

Learn how Trellix scales security operations with Amazon Bedrock, AWS EMR | AWS Events

AWSEvents #awsaidataconference #awsgenai

Building an end to end data strategy for analytics and generative AI | AWS Events

AWSEvents #awsaianddataconference #generativeaiconference #genaiconference #genaievent #AWSgenerativeai #AWSgenai

Strategies For A Successful Data Platform Migration

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

Deep Dive Into Grammarly's Data Platform

Migrate Apache Oozie Workflows to Airflow and Run with Amazon EMR