talk-data.com talk-data.com

Topic

ETL/ELT

ETL/ELT

data_integration data_transformation data_loading

108

tagged

Activity Trend

40 peak/qtr
2020-Q1 2026-Q1

Activities

108 activities · Newest first

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

Discover advanced strategies for implementing Apache Iceberg on AWS, focusing on Amazon S3 Tables and integration of Iceberg Rest Catalog with the lakehouse in Amazon SageMaker. We'll cover performance optimization techniques for Amazon Athena and Amazon Redshift queries, real-time processing using Apache Spark, and integration with Amazon EMR, AWS Glue, and Trino. Explore practical implementations of zero-ETL, change data capture (CDC) patterns, and medallion architecture. Gain hands-on expertise in implementing enterprise-grade lakehouse solutions with Iceberg on AWS.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Deep dive into databases zero-ETL integrations (DAT445)

In this session, learn how AWS zero-ETL integrations remove the need to manage complex data movement pipelines across multiple source database engines and targets so data engineers, architects, & DBAs can eliminate maintenance overhead while ensuring near real-time data availability for analytics & ML workloads. Examine the underlying architecture and how it works for the supported zero-ETL integrations between Amazon Aurora, Amazon DynamoDB, and Amazon RDS sources to Amazon Redshift, Amazon SageMaker, and Amazon OpenSearch Service targets - all without traditional ETL complexity. Dive into the data movement options, tunable settings, and how to monitor ongoing data movement.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Fast-track to insights: AWS-SAP data strategy (ANT333)

This lightning talk showcases AWS and SAP's innovative solution for enterprise data integration challenges. Learn how to access data between SAP and AWS environments, eliminating complex ETL pipelines while maintaining business context. In this talk, we will demonstrate how to enable zero-ETL integration between SAP and Amazon SageMaker so that you can reduce time spent building data pipelines and focus on running unified analytics and AI/ML on all your data.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Deep dive into Amazon Aurora and its innovations (DAT441)

With an innovative architecture that decouples compute from storage and advanced features like Amazon Aurora Global Database and low-latency read replicas, Aurora reimagines what it means to be a relational database. Aurora is a built-for-the-cloud, serverless relational database service that delivers unparalleled performance and availability at global scale for MySQL, PostgreSQL, and DSQL. In this session, dive deep into new features – and Aurora’s most popular offerings including serverless, I/O-Optimized, zero-ETL integrations, MCP integration, and generative AI support for vector search and storage. Also learn about the groundbreaking Aurora DSQL engine.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Universal data connectivity with ETL and SQL queries (ANT209)

Learn how AWS can help you with data integration and preparing data for analytics, machine learning (ML) and generative AI workloads. Explore new capabilities that enable your users to have controlled access to all relevant data, easily build and maintain scalable and resilient data pipelines, and enhance decision-making quality,all with exceptional price performance. See how zero-ETL and query federation can complement ETL and ELT data pipelines.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

Apache Spark on AWS Glue, Amazon EMR, and Amazon SageMaker enhances the optimization of large-scale data processing workloads. These include faster read and write throughput, accelerated processing of common file formats, and expanded Amazon S3 support through the S3A protocol for greater flexibility in write operations. In this session, we'll explore recent enhancements in Spark for distributed computation and in-memory storage to enable efficient data aggregation and job optimization. We'll also demonstrate how these innovations, combined with Spark's native capabilities, strengthen governance and encryption to help you optimize performance while maintaining control and compliance. Join us to learn how to build unified, secure, and high-performance ETL pipelines on AWS using Spark.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Connect to and manage any data, anywhere in Microsoft OneLake

Frontier firms aiming to build custom agents and foster data rich cultures need a single, unified access point for all their data that everyone can use. This session will explore how Microsoft OneLake can help you connect to any data, anywhere without ETL or data duplication. We'll also show how Fabric users and admins alike can manage and secure that data using tools like the OneLake catalog and OneLake security.

More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB

Most Python developers reach for Pandas or Polars when working with tabular data—but DuckDB offers a powerful alternative that’s more than just another DataFrame library. In this tutorial, you’ll learn how to use DuckDB as an in-process analytical database: building data pipelines, caching datasets, and running complex queries with SQL—all without leaving Python. We’ll cover common use cases like ETL, lightweight data orchestration, and interactive analytics workflows. You’ll leave with a solid mental model for using DuckDB effectively as the “SQLite for analytics.”

From Apache Airflow to Lakeflow Jobs: A Guide for Workflow Modernization

This is an overview of migrating from Apache Airflow to Lakeflow Jobs for modern data orchestration. It covers key differences, best practices and practical examples of transitioning from traditional Airflow DAGs orchestrating legacy systems to declarative, incremental ETL pipelines with Lakeflow. Attendees will gain actionable tips on how to improve efficiency, scalability and maintainability in their workflows.

Health Data, Delivered: How Lakeflow Declarative Pipelines Powers the HealthVerity Marketplace

Building scalable, reliable ETL pipelines is a challenge for organizations managing large, diverse data sources. Theseus, our custom ETL framework, streamlines data ingestion and transformation by fully leveraging Databricks-native capabilities, including Lakeflow Declarative Pipelines, auto loader and event-driven orchestration. By decoupling supplier logic and implementing structured bronze, silver, and gold layers, Theseus ensures high-performance, fault-tolerant data processing with minimal operational overhead. The result? Faster time-to-value, simplified governance and improved data quality — all within a declarative framework that reduces engineering effort. In this session, we’ll explore how Theseus automates complex data workflows, optimizes cost efficiency and enhances scalability, showcasing how Databricks-native tools drive real business outcomes.

Sponsored by: Soda Data Inc. | Clean Energy, Clean Data: How Data Quality Powers Decarbonization

Drawing on BDO Canada’s deep expertise in the electricity sector, this session explores how clean energy innovation can be accelerated through a holistic approach to data quality. Discover BDO’s practical framework for implementing data quality and rebuilding trust in data through a structured, scalable approach. BDO will share a real-world example of monitoring data at scale—from high-level executive dashboards to the details of daily ETL and ELT pipelines. Learn how they leveraged Soda’s data observability platform to unlock near-instant insights, and how they moved beyond legacy validation pipelines with built-in checks across their production Lakehouse. Whether you're a business leader defining data strategy or a data engineer building robust data products, this talk connects the strategic value of clean data with actionable techniques to make it a reality.

Embracing Unity Catalog and Empowering Innovation With Genie Room

Bagelcode, a leader in the social casino industry, has utilized Databricks since 2018 and manages over 10,000 tables via Hive Metastore. In 2024, we embarked on a transformative journey to resolve inefficiencies and unlock new capabilities. Over five months, we redesigned ETL pipelines with Delta Lake, optimized partitioned table logs and executed a seamless migration with minimal disruption. This effort improved governance, simplified management and unlocked Unity Catalog’s advanced features. Post-migration, we integrated the Genie Room with Slack to enable natural language queries, accelerating decision-making and operational efficiency. Additionally, a lineage-powered internal tool allowed us to quickly identify and resolve issues like backfill needs or data contamination. Unity Catalog has revolutionized our data ecosystem, elevating governance and innovation. Join us to learn how Bagelcode unlocked its data’s full potential and discover strategies for your own transformation.

Healthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox

Redox & Databricks direct integration can streamline your interoperability workflows from responding in record time to preauthorization requests to letting attending physicians know about a change in risk for sepsis and readmission in near real time from ADTs. Data engineers will learn how to create fully-streaming ETL pipelines for ingesting, parsing and acting on insights from Redox FHIR bundles delivered directly to Unity Catalog volumes. Once available in the Lakehouse, AI/BI Dashboards and Agentic Frameworks help write FHIR messages back to Redox for direct push down to EMR systems. Parsing FHIR bundle resources has never been easier with SQL combined with the new VARIANT data type in Delta and streaming table creation against Serverless DBSQL Warehouses. We'll also use Databricks accelerators dbignite and redoxwrite for writing and posting FHIR bundles back to Redox integrated EMRs and we'll extend AI/BI with Unity Catalog SQL UDFs and the Redox API for use in Genie.

How to Migrate From Snowflake to Databricks SQL

Migrating your Snowflake data warehouse to the Databricks Data Intelligence Platform can accelerate your data modernization journey. Though a cloud platform-to-cloud platform migration should be relatively easy, the breadth of the Databricks Platform provides flexibility and hence requires careful planning and execution. In this session, we present the migration methodology, technical approaches, automation tools, product/feature mapping, a technical demo and best practices using real-world case studies for migrating data, ELT pipelines and warehouses from Snowflake to Databricks.

Optimizing Smart Meter IIoT Data in Databricks for At-Scale Interactive Electrical Load Analytics

Octave is a Plotly Dash application used daily by about 1,000 Hydro-Québec technicians and engineers to analyze smart meter load and voltage data from 4.5M meters across the province. As adoption grew, Octave’s back end was migrated to Databricks to address increasingly massive scale (>1T data points), governance and security requirements. This talk will summarize how Databricks was optimized to support performant at-scale interactive Dash application experiences while in parallel managing complex back-end ETL processes. The talk will outline optimizations targeted to further optimize query latency and user concurrency, along with plans to increase data update frequency. Non-technology related success factors to be reviewed will include the value of: subject matter expertise, operational autonomy, code quality for long-term maintainability and proactive vendor technical support.

HP's Data Platform Migration Journey: Redshift to Lakehouse

HP Print's data platform team took on a migration from a monolithic, shared resource of AWS Redshift, to a modular and scalable data ecosystem on Databricks lakehouse.​ The result was 30–40% cost savings, scalable and isolated resources for different data consumers and ETL workloads, and performance optimization for a variety of query types.​ Through this migration, there were technical challenges and learnings relating to the ETL migrations with DBT, new Databricks features like Liquid Clustering, predictive optimization, Photon, SQL serverless warehouses, managing multiple teams on Unity Catalog, and others.​ This presentation dives into both the business and technical sides of this migration. Come along as we share our key takeaways from this journey.​

Race to Real-Time: Low-Latency Streaming ETL Meets Next-Gen Databricks OLTP-DB

In today’s digital economy, real-time insights and rapid responsiveness are paramount to delivering exceptional user experiences and lowering TCO. In this session, discover a pioneering approach that leverages a low-latency streaming ETL pipeline built with Spark Structured Streaming and Databricks’ new OLTP-DB—a serverless, managed Postgres offering designed for transactional workloads. Validated in a live customer scenario, this architecture achieves sub-2 second end-to-end latency by seamlessly ingesting streaming data from Kinesis and merging it into OLTP-DB. This breakthrough not only enhances performance and scalability but also provides a replicable blueprint for transforming data pipelines across various verticals. Join us as we delve into the advanced optimization techniques and best practices that underpin this innovation, demonstrating how Databricks’ next-generation solutions can revolutionize real-time data processing and unlock a myriad of new use cases in data landscape.

Master Schema Translations in the Era of Open Data Lake

Unity Catalog puts variety of schemas into a centralized repository, now the developer community wants more productivity and automation for schema inference, translation, evolution and optimization especially for the scenarios of ingestion and reverse-ETL with more code generations.Coinbase Data Platform attempts to pave a path with "Schemaster" to interact with data catalog with the (proposed) metadata model to make schema translation and evolution more manageable across some of the popular systems, such as Delta, Iceberg, Snowflake, Kafka, MongoDB, DynamoDB, Postgres...This Lighting Talk covers 4 areas: The complexity and caveats of schema differences among The proposed field-level metadata model, and 2 translation patterns: point-to-point vs hub-and-spoke Why Data Profiling be augmented to enhance schema understanding and translation Integrate it with Ingestion & Reverse-ETL in a Databricks-oriented eco system Takeaway: standardize schema lineage & translation

Somebody Set Up Us the Bomb: Identifying List Bombing of End Users in an Email Anti-Spam Context

Traditionally, spam emails are messages a user does not want, containing some kind of threat like phishing. Because of this, detection systems can focus on malicious content or sender behavior. List bombing upends this paradigm. By abusing public forms such as marketing signups, attackers can fill a user's inbox with high volumes of legitimate mail. These emails don't contain threats, and each sender is following best practices to confirm the recipient wants to be subscribed, but the net effect for an end user is their inbox being flooded with dozens of emails per minute. This talk covers the the exploration and implementation for identifying this attack in our company's anti-spam telemetry: from reading and writing to Kafka, Delta table streaming for ETL workflows, multi-table liquid clustering design for efficient table joins, curating gold tables to speed up critical queries and using Delta tables as an auditable integration point for interacting with external services.