talk-data.com talk-data.com

Topic

AWS Glue

etl data_catalog aws

43

tagged

Activity Trend

10 peak/qtr
2020-Q1 2026-Q1

Activities

43 activities · Newest first

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

Prepare to revolutionize your data infrastructure for the AI era with Amazon EMR, AWS Glue, and Amazon Athena. This session will guide you through leveraging these powerful AWS services to construct robust, scalable data architectures that empower AI solutions at scale. Gain insights into effective architectural strategies for data processing to build AI applications, optimizing for cost-efficiency and security. Explore architectural frameworks that underpin successful AI-driven data initiatives, and learn from field lessons on how to navigate modernization projects. Whether you’re starting your modernization journey or refining current setups, this session offers practical strategies to fast-track your organization towards achieving excellence in AI-powered data management.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Agentic data engineering with AWS Analytics MCP Servers (ANT335)

In this session, we will introduce AWS Analytics Model Context Protocol (MCP) Servers, including the Data Processing MCP Server and Amazon Redshift MCP Server, which enable agentic workflows across AWS Glue, Amazon EMR, Amazon Athena, and Amazon Redshift. You will learn how these open-source tools simplify complex analytics operations through natural language interactions with AI agents. We'll cover MCP server implementation strategies, real-world use cases, architectural patterns for deployment, and production best practices for building intelligent data engineering workflows that understand and orchestrate your analytics environment.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

Discover advanced strategies for implementing Apache Iceberg on AWS, focusing on Amazon S3 Tables and integration of Iceberg Rest Catalog with the lakehouse in Amazon SageMaker. We'll cover performance optimization techniques for Amazon Athena and Amazon Redshift queries, real-time processing using Apache Spark, and integration with Amazon EMR, AWS Glue, and Trino. Explore practical implementations of zero-ETL, change data capture (CDC) patterns, and medallion architecture. Gain hands-on expertise in implementing enterprise-grade lakehouse solutions with Iceberg on AWS.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

Explore the latest advancements in AWS Analytics designed to transform your data processing landscape. This session unveils powerful new capabilities across key services, including Amazon EMR for scalable big data processing, AWS Glue for seamless data integration, Amazon Athena for optimized querying, and Amazon Managed Workflows for Apache Airflow (MWAA) for workflow orchestration. Discover how these innovations can supercharge performance, optimize costs, and streamline your data ecosystem. Whether you're looking to enhance scalability, improve data integration, accelerate queries, or refine workflow management, join us to gain actionable insights that will position your organization at the forefront of data processing innovation.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

Apache Spark on AWS Glue, Amazon EMR, and Amazon SageMaker enhances the optimization of large-scale data processing workloads. These include faster read and write throughput, accelerated processing of common file formats, and expanded Amazon S3 support through the S3A protocol for greater flexibility in write operations. In this session, we'll explore recent enhancements in Spark for distributed computation and in-memory storage to enable efficient data aggregation and job optimization. We'll also demonstrate how these innovations, combined with Spark's native capabilities, strengthen governance and encryption to help you optimize performance while maintaining control and compliance. Join us to learn how to build unified, secure, and high-performance ETL pipelines on AWS using Spark.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Best practice for leveraging Amazon Analytic Services + dbt

As organizations increasingly adopt modern data stacks, the combination of dbt and AWS Analytics services emerged as a powerful pairing for analytics engineering at scale. This session will explore proven strategies and hard-learned lessons for optimizing this technology stack to use dbt-athena, dbt-redshift, and dbt-glue to deliver reliable, performant data transformations. We will also cover case studies, best practices, and modern lakehouse scenarios with Apache Iceberg and Amazon S3 Tables.

Operationalizing ML isn’t just about models — it’s about moving and engineering data. At Hopsworks, we built a composable AI pipeline builder (Brewer) based on two principles: Tasks and Data Sources. This lets users define workflows that automatically analyse, clean, create and update feature groups, without glue code or brittle scheduling logic.

In this talk, we’ll show how Brewer drives the automation of feature engineering, enabling reproducible, declarative pipelines that respond to changes in upstream data. We’ll explore how this fits into broader ML workflows, from ingestion to feature materialization, and how it integrates with warehouses, streams, and file-based systems.

This session explores the building blocks of next-generation data platforms, with a focus on framing the right questions to unlock innovation. We’ll showcase how AWS Glue, ETL pipelines, crawlers, and data catalogs can transform raw data into analytics-ready insights. Drawing on hands-on experience, we’ll share forward-thinking strategies, lessons learned, and emerging best practices to help you architect a data foundation that is intelligent, adaptable, and future-proof.

This is part two of the framework; if you missed part one, head to episode 175 and start there so you're all caught up. 

In this episode of Experiencing Data, I continue my deep dive into the MIRRR UX Framework for designing trustworthy agentic AI applications. Building on Part 1’s “Monitor” and “Interrupt,” I unpack the three R’s: Redirect, Rerun, and Rollback—and share practical strategies for data product managers and leaders tasked with creating AI systems people will actually trust and use. I explain human-centered approaches to thinking about automation and how to handle unexpected outcomes in agentic AI applications without losing user confidence. I am hoping this control framework will help you get more value out of your data while simultaneously creating value for the human stakeholders, users, and customers.

Highlights / Skip to:

Introducing the MIRRR UX Framework (1:08) Designing for trust and user adoption plus perspectives you should be including when designing systems. (2:31) Monitor and interrupt controls let humans pause anything from a single AI task to the entire agent (3:17) Explaining “redirection” in the example context of use cases for claims adjusters working on insurance claims—so adjusters (users) can focus on important decisions. (4:35)  Rerun controls: lets humans redo an angentic task after unexpected results, preventing errors and building trust in early AI rollouts (11:12) Rerun vs. Redirect: what the difference is in the context of AI, using additional use cases from the insurance claim processing domain  (12:07) Empathy and user experience in AI adoption, and how the most useful insights come from directly observing users—not from analytics (18:28) Thinking about agentic AI as glue for existing applications and workflows, or as a worker  (27:35)

Quotes from Today’s Episode

The value of AI isn’t just about technical capability, it’s based in large part on whether the end-users will actually trust and adopt it. If we don’t design for trust from the start, even the most advanced AI can fail to deliver value."

"In agentic AI, knowing when to automate is just as important as knowing what to automate. Smart product and design decisions mean sometimes holding back on full automation until the people, processes, and culture are ready for it."

"Sometimes the most valuable thing you can do is slow down, create checkpoints, and give people a chance to course-correct before the work goes too far in the wrong direction."

"Reruns and rollbacks shouldn’t be seen as failures, they’re essential safety mechanisms that protect both the integrity of the work and the trust of the humans in the loop. They give people the confidence to keep using the system, even when mistakes happen."

"You can’t measure trust in an AI system by counting logins or tracking clicks. True adoption comes from understanding the people using it, listening to them, observing their workflows, and learning what really builds or breaks their confidence."

"You’ll never learn the real reasons behind a team’s choices by only looking at analytics, you have to actually talk to them and watch them work."

"Labels matter, what you call a button or an action can shape how people interpret and trust what will happen when they click it."

Quotes from Today’s Episode

Part 1: The MIRRR UX Framework for Designing Trustworthy Agentic AI Applications 

This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.

Databricks + Apache Iceberg™: Managed and Foreign Tables in Unity Catalog

Unity Catalog support for Apache Iceberg™ brings open, interoperable table formats to the heart of the Databricks Lakehouse. In this session, we’ll introduce new capabilities that allow you to write Iceberg tables from any REST-compatible engine, apply fine-grained governance across all data, and unify access to external Iceberg catalogs like AWS Glue, Hive Metastore, and Snowflake Horizon. Learn how Databricks is eliminating data silos, simplifying performance with Predictive Optimization, and advancing a truly open lakehouse architecture with Delta and Iceberg side by side.

Cross-Cloud Data Mesh with Delta Sharing and UniForm in Mercedes-Benz

In this presentation, we'll show how we achieved a unified development experience for teams working on Mercedes-Benz Data Platforms in AWS and Azure. We will demonstrate how we implemented Azure to AWS and AWS to Azure data product sharing (using Delta Sharing and Cloud Tokens), integration with AWS Glue Iceberg tables through UniForm and automation to drive everything using Azure DevOps Pipelines and DABs. We will also show how to monitor and track cloud egress costs and how we present a consolidated view of all the data products and relevant cost information. The end goal is to show how customers can offer the same user experience to their engineers and not have to worry about which cloud or region the Data Product lives in. Instead, they can enroll in the data product through self-service and have it available to them in minutes, regardless of where it originates.

You shouldn’t have to sacrifice data governance just to leverage the tools your business needs. In this session, we will give practical tips on how you can cut through the data sprawl and get a unified view of your data estate in Unity Catalog without disrupting existing workloads. We will walk through how to set up federation with Glue, Hive Metastore, and other catalogs like Snowflake, and show you how powerful new tools help you adopt Databricks at your own pace with no downtime and full interoperability.

AWS re:Invent 2024 - Monitor and manage data quality (ANT343)

The quality of data powers business decisions that drive outcomes. Successful businesses run on trusted data that is reliable and accurate. Join this session to learn how to apply Amazon DataZone and AWS Glue to deliver data integrity and consistency through precise data transformation, data cataloging, data governance, and data lineage, as well as to set up data quality checks, automate validation processes, and manage metadata.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Empower your data journey with Amazon DataZone’s data lineage (ANT207-NEW)

Worried about using the right data for analysis? With the new OpenLineage-compatible data lineage feature in Amazon DataZone, you can now trace the origin, transformations, and usage of data in one easy view. Automate lineage capture from AWS Glue, Amazon Redshift, and more to gain deep insights into your data’s journey. Join this session to explore how this powerful feature helps data teams confidently understand and use data to drive business value.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - AI-powered data integration and governance with Amazon Q Developer (ANT352-NEW)

Discover how the AI-driven capabilities of Amazon Q Developer streamline data integration across AWS services, such as AWS Glue, Amazon SageMaker Catalog, Amazon Redshift, Amazon SageMaker AI, and more. Learn how data engineers and ETL developers can build complex jobs, troubleshoot, and explore data using natural language through an intuitive chat interface in Amazon SageMaker Unified Studio. Join this session to see how Amazon Q Developer enhances productivity and accelerates workflows, transforming the way you handle data integration.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024-Zero-ETL replication to Amazon SageMaker Lakehouse & Amazon Redshift (ANT353-NEW)

In today’s data-driven landscape, organizations rely on enterprise applications to manage critical business processes. However, extracting and integrating this data into data warehouses and data lakes can be complex. This session explores a new zero-ETL capability that simplifies ingesting data to Amazon SageMaker Lakehouse and Amazon Redshift via AWS Glue from enterprise applications such as Salesforce, ServiceNow, and Zendesk. See how zero-ETL automates the extract and load process, expanding your analytics and machine solutions with valuable SaaS data.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Explore what’s new in analytics and governance (ANT303)

Join this session to explore the latest data governance innovations and features in AWS analytics. Our experts guide you through the latest innovations in Amazon DataZone, AWS Lake Formation, and AWS Glue that are helping organizations establish robust data governance frameworks and maintain compliance standards.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Solving different data ingestion use cases with AWS (ANT330)

Ingesting data is typically the first step in building your data pipelines. The growing landscape of data types like unstructured data, incremental data, and open table formats such as Apache Iceberg makes it all the more critical to build durable data pipelines, land the data immediately, apply the desired schema structure, and provide quality outputs for different types of use cases. Join this session to explore specific solutions that can help solve for different data ingestion challenges. Learn about the robust architectures and key strategies for efficiently ingesting and processing data with services like AWS Glue, Amazon Kinesis, Amazon Redshift, and Amazon OpenSearch Service.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - Demystify and democratize access to your data with a business catalog (ANT202)

Understanding your data in context means that all users can discover and comprehend the meaning of their data so they can use it confidently to drive business value. With a centralized data catalog, data can be found easily, data quality can be quantified and tracked with lineage, access permissions can be requested and provisioned, and data can be used to make business decisions. In this session, learn how Amazon DataZone, AWS Glue Data Catalog, and AWS Lake Formation help you build a catalog accessible to all of your data marketplace users.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024