When Rivers Speak: Analyzing Massive Water Quality Datasets using USGS API and Remote SSH in Positron

2025-12-10 · PyData Boston 2025 Watch

talk

by Rodrigo Silva Ferreira

Analytics API DuckDB HTML Parquet Python

Rivers have long been storytellers of human history. From the Nile to the Yangtze, they have shaped trade, migration, settlement, and the rise of civilizations. They reveal the traces of human ambition... and the costs of it. Today, from the Charles to the Golden Gate, US rivers continue to tell stories, especially through data.

Over the past decades, extensive water quality monitoring efforts have generated vast public datasets: millions of measurements of pH, dissolved oxygen, temperature, and conductivity collected across the country. These records are more than environmental snapshots; they are archives of political priorities, regulatory choices, and ecological disruptions. Ultimately, they are evidence of how societies interact with their environments, often unevenly.

In this talk, I’ll explore how Python and modern data workflows can help us "listen" to these stories at scale. Using the United States Geological Survey (USGS) Water Data APIs and Remote SSH in Positron, I’ll process terabytes of sensor data spanning several years and regions. I’ll demonstrate that, while Parquet and DuckDB enable scalable exploration of historical records, using Remote SSH is paramount in order to enable large-scale data analysis. By doing so, I hope to answer some analytical questions that can surface patterns linked to industrial growth, regulatory shifts, and climate change.

By treating rivers as both ecological systems and social mirrors, we can begin to see how environmental data encodes histories of inequality, resilience, and transformation.

Whether your interest lies in data engineering, environmental analytics, or the human dimensions of climate and infrastructure, this talk will explore topics at the intersection of environmental science, will offer both technical methods and sociological lenses to understand the stories rivers continue to tell.

From Notebook to Pipeline: Hands-On Data Engineering with Python

2025-12-08 · PyData Boston 2025 Watch

talk

by Gilberto Hernandez

Cloud Computing Python SQL

In this hands-on tutorial, you'll go from a blank notebook to a fully orchestrated data pipeline built entirely in Python, all in under 90 minutes. You'll learn how to design and deploy end-to-end data pipelines using familiar notebook environments, using Python for your data loading, data transformations, and insights delivery.

We'll dive into the Ingestion-Tranformation-Delivery (ITD) framework for building data pipelines: ingest raw data from cloud object storage, transform the data using Python DataFrames, and deliver insights via a Streamlit application.

Basic familiarity with Python (and/or SQL) is helpful, but not required. By the end of the session, you'll understand practical data engineering patterns and leave with reusable code templates to help you build, orchestrate, and deploy data pipelines from notebook environments.

AWS re:Invent 2025 - Agentic data engineering with AWS Analytics MCP Servers (ANT335)

2025-12-07 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics Athena AWS Amazon EMR AWS Glue Cloud Computing Redshift

In this session, we will introduce AWS Analytics Model Context Protocol (MCP) Servers, including the Data Processing MCP Server and Amazon Redshift MCP Server, which enable agentic workflows across AWS Glue, Amazon EMR, Amazon Athena, and Amazon Redshift. You will learn how these open-source tools simplify complex analytics operations through natural language interactions with AI agents. We'll cover MCP server implementation strategies, real-world use cases, architectural patterns for deployment, and production best practices for building intelligent data engineering workflows that understand and orchestrate your analytics environment.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Accelerating data engineering with AI Agents for AWS Analytics (ANT215)

2025-12-05 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics AWS Cloud Computing Python Amazon SageMaker Spark SQL

Data engineers face critical time sinks: writing code to build analytics pipelines from scratch and upgrading Apache Spark versions. In this lightning talk, discover how AWS is addressing both challenges with AI agents that accelerate development cycles. Learn how the Amazon SageMaker Data Agent transforms natural language instructions into executable SQL and Python code within SageMaker notebooks, maintaining full context awareness of your data sources and schemas. Then explore the Apache Spark upgrade agent, which accelerates complex multi-month upgrade projects into week-long initiatives through automated code analysis and transformation. Walk away understanding how these agents work to automate manual work from your data engineering workflows, whether you're building new applications or modernizing existing ones.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Minus Three Tier: Data Architecture Turned Upside Down

2025-09-26 · PyData Amsterdam 2025 Watch

talk

by Hannes Mühleisen (DuckDB Labs)

API Data Lakehouse DWH

Every data architecture diagram out there makes it abundantly clear who's in charge: At the bottom sits the analyst, above that is an API server, and on the very top sits the mighty data warehouse. This pattern is so ingrained we never ever question its necessity, despite its various issues like slow data response time, multi-level scaling issues, and massive cost.

But there is another way: Disconnect of storage and compute enables localization of query processing closer to people, leading to much snappier responses, natural scaling with client-side query processing, and much reduced cost.

In this talk, it will be discussed how modern data engineering paradigms like decomposition of storage, single-node query processing, and lakehouse formats enable a radical departure from the tired three-tier architecture. By inverting the architecture we can put user's needs first. We can rely on commoditised components like object store to enable fast, scalable, and cost-effective solutions.

What's New: Scaling Data Pipelines with SQL, dbt Projects, and Python

2025-06-17 · Summit 2025 - On Demand Watch

session

dbt Python Snowflake SQL

Learn how to efficiently scale and manage data engineering pipelines with Snowflake's latest capabilities for SQL- and Python-based transformations. Join us for new product and feature overviews, best practices and live demos.

Kill Bill-ing? Revenge is a Dish Best Served Optimized with GenAI

2025-06-12 · Data + AI Summit 2025 Watch

lightning_talk

by Abdul Furkhan (Sportsbet)

AI/ML Cloud Computing Databricks GenAI Spark

In an era where cloud costs can spiral out of control, Sportsbet achieved a remarkable 49% reduction in Total Cost of Ownership (TCO) through an innovative AI-powered solution called 'Kill Bill.' This presentation reveals how we transformed Databricks' consumption-based pricing model from a challenge into a strategic advantage through an intelligent automation and optimization. Understand how to use GenAI to reduce Databricks TCO Leverage generative AI within Databricks solutions enables automated analysis of cluster logs, resource consumption, configurations, and codebases to provide Spark optimization suggestions Create AI agentic workflows by integrating Databricks' AI tools and Databricks Data Engineering tools Review a case study demonstrating how Total Cost of Ownership was reduced in practice. Attendees will leave with a clear understanding of how to implement AI within Databricks solutions to address similar cost challenges in their environments.

Sponsored by: Dagster Labs | The Age of AI is Changing Data Engineering for Good

Databricks Lakeflow: the Foundation of Data + AI Innovation for Your Industry

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Sam Sawyer (Databricks) , Ori Zohar (Databricks)

AI/ML Analytics BI Databricks

Every analytics, BI and AI project relies on high-quality data. This is why data engineering, the practice of building reliable data pipelines that ingest and transform data, is consequential to the success of these projects. In this session, we'll show how you can use Lakeflow to accelerate innovation in multiple parts of the organization. We'll review real-world examples of Databricks customers using Lakeflow in different industries such as automotive, healthcare and retail. We'll touch on how the foundational data engineering capabilities Lakeflow provides help power initiatives that improve customer experiences, make real-time decisions and drive business results.

Getting the Most Out of Lakeflow Declarative Pipelines: A Deep Dive on What’s New and Best Practices

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Michael Armbrust (Databricks)

This deep dive covers advanced usage patterns, tips and best practices for maximizing the potential of Lakeflow Declarative Pipelines. Attendees will explore new features, enhanced workflows and cost-optimization strategies through a demo-heavy presentation. The session will also address complex use cases, showcasing how Lakeflow Declarative Pipelines simplifies the management of robust data pipelines while maintaining scalability and efficiency across diverse data engineering challenges.

IQVIA’s Serverless Journey: Enabling Data and AI in a Regulated World

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Alex Esibov (Databricks) , Matthew Schwartz (IQVIA)

AI/ML Analytics Data Analytics Databricks Cyber Security

Your data and AI use-cases are multiplying. At the same time, there is increased focus and scrutiny to meet sophisticated security and regulatory requirements. IQVIA utilizes serverless use-cases across data engineering, data analytics, and ML and AI, to empower their customers to make informed decisions, support their R&D processes and improve patient outcomes. By leveraging native controls on the platform, serverless enables them to streamline their use cases while maintaining a strong security posture, top performance and optimized costs. This session will go over IQVIA’s journey to serverless, how they met their security and regulatory requirements, and the latest and upcoming enhancements to the Databricks Platform.

A Practical Roadmap to Becoming an Expert Databricks Data Engineer

2025-06-12 · Data + AI Summit 2025 Watch

lightning_talk

by Derar Alhussein (Acadford)

Databricks

The demand for skilled Databricks data engineers continues to rise as enterprises accelerate their adoption of the Databricks platform. However, navigating the complex ecosystem of data engineering tools, frameworks and best practices can be overwhelming. This session provides a structured roadmap to becoming an expert Databricks data engineer, offering a clear progression from foundational skills to advanced capabilities. Acadford, a leading training provider, has successfully trained thousands of data engineers on Databricks, equipping them with the skills needed to excel in their careers and obtain professional certifications. Drawing on this experience, we will guide attendees through the most in-demand skills and knowledge areas through a combination of structured learning and practical insights. Key takeaways: Understand the core tech stack in Databricks Explore real-world code examples and live demonstrations Receive an actionable learning path with recommended resources

Automating Engineering with AI - LLMs in Metadata Driven Frameworks

2025-06-12 · Data + AI Summit 2025 Watch

lightning_talk

by Simon Whiteley (Advancing Analytics)

AI/ML Data Quality LLM

The demand for data engineering keeps growing, but data teams are bored by repetitive tasks, stumped by growing complexity and endlessly harassed by an unrelenting need for speed. What if AI could take the heavy lifting off your hands? What if we make the move away from code-generation and into config-generation — how much more could we achieve? In this session, we’ll explore how AI is revolutionizing data engineering, turning pain points into innovation. Whether you’re grappling with manual schema generation or struggling to ensure data quality, this session offers practical solutions to help you work smarter, not harder. You’ll walk away with a good idea of where AI is going to disrupt the data engineering workload, some good tips around how to accelerate your own workflows and an impending sense of doom around the future of the industry!

Democratizing Data Engineering with Databricks and dbt at Ludia

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Jean-Christophe Rodrigue (Ludia) , Huntting Buckley (Databricks)

Databricks dbt

Ludia, a leading mobile gaming company, is empowering its analysts and domain experts by democratizing data engineering with Databricks and dbt. This talk explores how Ludia enabled cross-functional teams to build and maintain production-grade data pipelines without relying solely on centralized data engineering resources—accelerating time to insight, improving data reliability, and fostering a culture of data ownership across the organization.

How Navy Federal's Enterprise Data Ecosystem Leverages Unity Catalog for Data + AI Governance

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Krishnakumar Sivasubramanian (NFCU) , Ricardo Portilla (Databricks)

AI/ML Cloud Computing Data Lake DWH

Navy Federal Credit Union has 200+ enterprise data sources in the enterprise data lake. These data assets are used for training 100+ machine learning models and hydrating a semantic layer for serving, at an average 4,000 business users daily across the credit union. The only option for extracting data from analytic semantic layer was to allow consuming application to access it via an already-overloaded cloud data warehouse. Visualizing data lineage for 1,000 + data pipelines and associated metadata is impossible and understanding the granular cost for running data pipelines is a challenge. Implementing Unity Catalog opened alternate path for accessing analytic semantic data from lake. It also opened the doors to remove duplicate data assets stored across multiple lakes which will save hundred thousands of dollars in data engineering efforts, compute and storage costs.

Lakeflow in Production: CI/CD, Testing and Monitoring at Scale

2025-06-12 · Data + AI Summit 2025 Watch

talk

by Adriana Ispas (Databricks) , Lennart Kats (Databricks)

CI/CD Data Quality Databricks Git

Building robust, production-grade data pipelines goes beyond writing transformation logic — it requires rigorous testing, version control, automated CI/CD workflows and a clear separation between development and production. In this talk, we’ll demonstrate how Lakeflow, paired with Databricks Asset Bundles (DABs), enables Git-based workflows, automated deployments and comprehensive testing for data engineering projects. We’ll share best practices for unit testing, CI/CD automation, data quality monitoring and environment-specific configurations. Additionally, we’ll explore observability techniques and performance tuning to ensure your pipelines are scalable, maintainable and production-ready.

Sponsored by: Acceldata | Agentic Data Management: Trusted Data for Enterprise AI on Databricks

Harnessing Databricks Asset Bundles: Transforming Pipeline Management at Scale at Stack Overflow

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Chelsea Zhang (Stack Overflow)

Databricks Cyber Security

Discover how Stack Overflow optimized its data engineering workflows using Databricks Asset Bundles (DABs) for scalable and efficient pipeline deployments. This session explores the structured pipeline architecture, emphasizing code reusability, modular design and bundle variables to ensure clarity and data isolation across projects. Learn how the data team leverages enterprise infrastructure to streamline deployment across multiple environments. Key topics include DRY-principled modular design, essential DAB features for automation and data security strategies using Unity Catalog. Designed for data engineers and teams managing multi-project workflows, this talk offers actionable insights on optimizing pipelines with Databricks evolving toolset.

talk-data.com

Data Engineering

Activity Trend

Top Events

Top Speakers

When Rivers Speak: Analyzing Massive Water Quality Datasets using USGS API and Remote SSH in Positron

From Notebook to Pipeline: Hands-On Data Engineering with Python

AWS re:Invent 2025 - Agentic data engineering with AWS Analytics MCP Servers (ANT335)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Accelerating data engineering with AI Agents for AWS Analytics (ANT215)

AWSreInvent #AWSreInvent2025 #AWS

Minus Three Tier: Data Architecture Turned Upside Down

What's New: Scaling Data Pipelines with SQL, dbt Projects, and Python

Kill Bill-ing? Revenge is a Dish Best Served Optimized with GenAI

Sponsored by: Dagster Labs | The Age of AI is Changing Data Engineering for Good

Databricks Lakeflow: the Foundation of Data + AI Innovation for Your Industry

Getting the Most Out of Lakeflow Declarative Pipelines: A Deep Dive on What’s New and Best Practices

IQVIA’s Serverless Journey: Enabling Data and AI in a Regulated World

A Practical Roadmap to Becoming an Expert Databricks Data Engineer

Automating Engineering with AI - LLMs in Metadata Driven Frameworks

Democratizing Data Engineering with Databricks and dbt at Ludia

How Navy Federal's Enterprise Data Ecosystem Leverages Unity Catalog for Data + AI Governance

Lakeflow in Production: CI/CD, Testing and Monitoring at Scale

Sponsored by: Acceldata | Agentic Data Management: Trusted Data for Enterprise AI on Databricks

Sponsored by: dbt Labs | Leveling Up Data Engineering at Riot: How We Rolled Out dbt and Transformed the Developer Experience

Sponsored by: West Monroe | Disruptive Forces: LLMs and the New Age of Data Engineering

Harnessing Databricks Asset Bundles: Transforming Pipeline Management at Scale at Stack Overflow