talk-data.com talk-data.com

Topic

DuckDB

embedded_database analytics olap

98

tagged

Activity Trend

13 peak/qtr
2020-Q1 2026-Q1

Activities

98 activities · Newest first

When Rivers Speak: Analyzing Massive Water Quality Datasets using USGS API and Remote SSH in Positron

Rivers have long been storytellers of human history. From the Nile to the Yangtze, they have shaped trade, migration, settlement, and the rise of civilizations. They reveal the traces of human ambition... and the costs of it. Today, from the Charles to the Golden Gate, US rivers continue to tell stories, especially through data.

Over the past decades, extensive water quality monitoring efforts have generated vast public datasets: millions of measurements of pH, dissolved oxygen, temperature, and conductivity collected across the country. These records are more than environmental snapshots; they are archives of political priorities, regulatory choices, and ecological disruptions. Ultimately, they are evidence of how societies interact with their environments, often unevenly.

In this talk, I’ll explore how Python and modern data workflows can help us "listen" to these stories at scale. Using the United States Geological Survey (USGS) Water Data APIs and Remote SSH in Positron, I’ll process terabytes of sensor data spanning several years and regions. I’ll demonstrate that, while Parquet and DuckDB enable scalable exploration of historical records, using Remote SSH is paramount in order to enable large-scale data analysis. By doing so, I hope to answer some analytical questions that can surface patterns linked to industrial growth, regulatory shifts, and climate change.

By treating rivers as both ecological systems and social mirrors, we can begin to see how environmental data encodes histories of inequality, resilience, and transformation.

Whether your interest lies in data engineering, environmental analytics, or the human dimensions of climate and infrastructure, this talk will explore topics at the intersection of environmental science, will offer both technical methods and sociological lenses to understand the stories rivers continue to tell.

Extending SQL Databases with Python

What if your database could run Python code inside SQL? In this talk, we’ll explore how to extend popular databases using Python, without needing to write a line of C.

We’ll cover three systems—SQLite, DuckDB, and PostgreSQL—and show how Python can be used in each to build custom SQL functions, accelerate data workflows, and prototype analytical logic. Each database offers a unique integration path: - SQLite and DuckDB allow you to register Python functions directly into SQL via sqlite3.create_function, making it easy to inject business logic or custom transformations. - PostgreSQL offers PL/Python, a full-featured procedural language for writing SQL functions in Python. We’ll also touch on advanced use cases, including embedding the Python interpreter directly into a PostgreSQL extension for deeper integration.

By the end of this talk, you’ll understand the capabilities, limitations, and gotchas of Python-powered extensions in each system—and how to choose the right tool depending on your use case, whether you’re analyzing data, building pipelines, or hacking on your own database.

In this 2-hour hands-on workshop, you'll build an end-to-end streaming analytics pipeline that captures live cryptocurrency prices, processes them in real-time, and uses AI to forecast the future. Ingest live crypto data into Apache Kafka using Kafka Connect; tame that chaos with Apache Flink's stream processing; freeze streams into queryable Apache Iceberg tables using Tableflow; and forecast price trends with Flink AI.

DuckDB is the best way to execute SQL on a single node. But with its embedding-friendly nature, it makes an excellent foundation for building distributed systems. George Fraser, CEO of Fivetran, will tell us how Fivetran used DuckDB to power its Iceberg data lake writer—coordinating thousands of small, parallel tasks across a fleet of workers, each running DuckDB queries on bounded datasets. The result is a high-throughput, dual-format (Iceberg + Delta) data lake architecture where every write scales linearly, snapshots stay perfectly in sync, and performance rivals a commercial database while remaining open and portable.

We introduce a privacy-forward, secure, extensible, easy-to-use web application, Explore23, for browsing the multimodal data that has been collected as part of the 23andMe, Inc. Research cohort, built heavily on the DuckDB ecosystem. While the 23andMe Research program has collected a large number of data types from its >11M customers who have consented to participate in its Research program, there has not yet been a comprehensive tool enabling the exploration and visualization of the cohort, which is invaluable for genomics-driven target discovery and validation. Furthermore, any exploration of the 23andMe Research cohort needed to enable extensibility to future data types and applications, scalability for large participant and variant cohorts, comprehension by non-experts and external parties, and most importantly, protection of research participant privacy. The Explore23 tool utilizes DuckDB and the DuckDB extension ecosystem extensively through the lifecycle of data used in the showcase. A combination of pre-processing, backend result generation, and WASM-powered Mosaic integrations enable rapid search and visualization of the wide range of datasets collected. This includes integrating data from the various stages of the 23andMe research "pipeline": including raw survey questions, curated condition-based cohorts, genetic variants, and GWAS results. Of particular interest are the variant browser, which enables rapid, in-browser visualization of the over 170 million imputed and genotyped genetic variants in the 23andMe genetic panels; and the phenotypic pedigree summaries, which merges columnar datasets and graph queries (via DuckPGQ) to rapidly identify related participants in the 23andMe research cohort that share specific conditions. For each feature, there were challenges, both internal and external, in finding and contextualizing specific datasets for groups not already well acquainted with the data (e.g., even browsing surveys), and managing data scale. The front-end serves data that has been pre-processed through rigorous masking logic to protect participant privacy. In sum, Explore23 is an invaluable tool for research scientists exploring the immense complexity and diverse data of the whole 23andMe research cohort data. It highlights the incredible versatility of the DuckDB ecosystem to unify data access from raw result processing up through in-browser visualizations.

We were told to scale compute. But what if the real problem was never about big data, but about bad data access? In this talk, we’ll unpack two powerful, often misunderstood techniques—projection pushdown and predicate pushdown—and why they matter more than ever in a world where we want lightweight, fast queries over large datasets. These optimizations aren’t just academic—they’re the difference between querying a terabyte in seconds vs. minutes. We’ll show how systems like Flink and DuckDB leverage these techniques, what limits them (hello, Protobuf), and how smart schema and storage design—especially in formats like Iceberg and Arrow can unlock dramatic speed gains. Along the way, we’ll highlight the importance of landing data in queryable formats, and why indexing and query engines matter just as much as compute. This talk is for anyone who wants to stop fully scanning their data lakes just to read one field.

How to make public data more accessible with "baked" data and DuckDB

Publicly available data is rarely analysis-ready, hampering researchers, organizations, and the public from easily accessing the information these datasets contain. One way to address this shortcoming is to "bake" the data into a structured format and ship it alongside code that can be used for analysis. For analytical work in particular, DuckDB provides a performant way to query the structured data in a variety of contexts.

This talk will explore the benefits and tradeoffs of this architectural pattern using the design of scipeds–an open source Python package for analyzing higher-education data in the US–as a case study.

No DuckDB experience required, beginner Python and programming experience recommended. This talk is aimed at data practitioners, especially those who work with public datasets.

At PyData Berlin, community members and industry voices highlighted how AI and data tooling are evolving across knowledge graphs, MLOps, small-model fine-tuning, explainability, and developer advocacy.

  • Igor Kvachenok (Leuphana University / ProKube) combined knowledge graphs with LLMs for structured data extraction in the polymer industry, and noted how MLOps is shifting toward LLM-focused workflows.
  • Selim Nowicki (Distill Labs) introduced a platform that uses knowledge distillation to fine-tune smaller models efficiently, making model specialization faster and more accessible.
  • Gülsah Durmaz (Architect & Developer) shared her transition from architecture to coding, creating Python tools for design automation and volunteering with PyData through PyLadies.
  • Yashasvi Misra (Pure Storage) spoke on explainable AI, stressing accountability and compliance, and shared her perspective as both a data engineer and active Python community organizer.
  • Mehdi Ouazza (MotherDuck) reflected on developer advocacy through video, workshops, and branding, showing how creative communication boosts adoption of open-source tools like DuckDB.

Igor Kvachenok Master’s student in Data Science at Leuphana University of Lüneburg, writing a thesis on LLM-enhanced data extraction for the polymer industry. Builds RDF knowledge graphs from semi-structured documents and works at ProKube on MLOps platforms powered by Kubeflow and Kubernetes.

Connect: https://www.linkedin.com/in/igor-kvachenok/

Selim Nowicki Founder of Distill Labs, a startup making small-model fine-tuning simple and fast with knowledge distillation. Previously led data teams at Berlin startups like Delivery Hero, Trade Republic, and Tier Mobility. Sees parallels between today’s ML tooling and dbt’s impact on analytics.

Connect: https://www.linkedin.com/in/selim-nowicki/

Gülsah Durmaz Architect turned developer, creating Python-based tools for architectural design automation with Rhino and Grasshopper. Active in PyLadies and a volunteer at PyData Berlin, she values the community for networking and learning, and aims to bring ML into architecture workflows.

Connect: https://www.linkedin.com/in/gulsah-durmaz/

Yashasvi (Yashi) Misra Data Engineer at Pure Storage, community organizer with PyLadies India, PyCon India, and Women Techmakers. Advocates for inclusive spaces in tech and speaks on explainable AI, bridging her day-to-day in data engineering with her passion for ethical ML.

Connect: https://www.linkedin.com/in/misrayashasvi/

Mehdi Ouazza Developer Advocate at MotherDuck, formerly a data engineer, now focused on building community and education around DuckDB. Runs popular YouTube channels ("mehdio DataTV" and "MotherDuck") and delivered a hands-on workshop at PyData Berlin. Blends technical clarity with creative storytelling.

Connect: https://www.linkedin.com/in/mehd-io/

DuckDB is well-loved by SQL-ophiles to handle their small data workloads. How do you make it scale? What happens when you feed it Big Data? What is this DuckLake thing I've been hearing about? This talk will help answer these questions from real-world experience running a DuckDB service in the cloud.

Summary In this episode of the Data Engineering Podcast Hannes Mühleisen and Mark Raasveldt, the creators of DuckDB, share their work on Duck Lake, a new entrant in the open lakehouse ecosystem. They discuss how Duck Lake, is focused on simplicity, flexibility, and offers a unified catalog and table format compared to other lakehouse formats like Iceberg and Delta. Hannes and Mark share insights into how Duck Lake revolutionizes data architecture by enabling local-first data processing, simplifying deployment of lakehouse solutions, and offering benefits such as encryption features, data inlining, and integration with existing ecosystems.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Hannes Mühleisen and Mark Raasveldt about DuckLake, the latest entrant into the open lakehouse ecosystemInterview IntroductionHow did you get involved in the area of data management?Can you describe what DuckLake is and the story behind it?What are the particular problems that DuckLake is solving for?How does this compare to the capabilities of MotherDuck?Iceberg and Delta already have a well established ecosystem, but so does DuckDB. Who are the primary personas that you are trying to focus on in these early days of DuckLake?One of the major factors driving the adoption of formats like Iceberg is cost efficiency for large volumes of data. That brings with it challenges of large batch processing of data. How does DuckLake account for these axes of scale?There is also a substantial investment in the ecosystem of technologies that support Iceberg. The most notable ecosystem challenge for DuckDB and DuckLake is in the query layer. How are you thinking about the evolution and growth of that capability beyond DuckDB (e.g. support in Trino/Spark/Flink)?What are your opinions on the viability of a future where DuckLake and Iceberg become a unified standard and implementation? (why can't Iceberg REST catalog implementations just use DuckLake under the hood?)Digging into the specifics of the specification and implementation, what are some of the capabilities that it offers above and beyond Iceberg?Is it now possible to enforce PK/FK constraints, indexing on underlying data?Given that DuckDB has a vector type, how do you think about the support for vector storage/indexing?How do the capabilities of DuckLake and the integration with DuckDB change the ways that data teams design their data architecture and access patterns?What are your thoughts on the impact of "data gravity" in today's data ecosystem, with engines like DuckDB, KuzuDB, LanceDB, etc. available for embedded and edge use cases?What are the most interesting, innovative, or unexpected ways that you have seen DuckLake used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckLake?When is DuckLake the wrong choice?What do you have planned for the future of DuckLake?Contact Info HannesWebsiteMarkWebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DuckDBPodcast EpisodeDuckLakeDuckDB LabsMySQLCWIMonetDBIcebergIceberg REST CatalogDeltaHudiLanceDuckDB Iceberg ConnectorACID == Atomicity, Consistency, Isolation, DurabilityMotherDuckMotherDuck Managed DuckLakeTrinoSparkPrestoSpark DuckLake DemoDelta KernelArrowdltS3 TablesAttribute Based Access Control (ABAC)ParquetArrow FlightHadoopHDFSDuckLake RoadmapThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Forget the Cloud: Building Lean Batch Pipelines from TCP Streams with Python and DuckDB

Many industrial and legacy systems still push critical data over TCP streams. Instead of reaching for heavyweight cloud platforms, you can build fast, lean batch pipelines on-prem using Python and DuckDB.

In this talk, you'll learn how to turn raw TCP streams into structured data sets, ready for analysis, all running on-premise. We'll cover key patterns for batch processing, practical architecture examples, and real-world lessons from industrial projects.

If you work with sensor data, logs, or telemetry, and you value simplicity, speed, and control this talk is for you.

Narwhals: enabling universal dataframe support

Ever tried passing a Polars Dataframe to a data science library and found that it...just works? No errors, no panics, no noticeable overhead, just...results? This is becoming increasingly common in 2025, yet only 2 years ago, it was mostly unheard of. So, what changed? A large part of the answer is: Narwhals.

Narwhals is a lightweight compatibility layer between dataframe libraries which lets your code work seamlessly across Polars, pandas, PySpark, DuckDB, and more! And it's not just a theoretical possibility: with ~30 million monthly downloads and set as a required dependency of Altair, Bokeh, Marimo, Plotly, Shiny, and more, it's clear that it's reshaping the data science landscape. By the end of the talk, you'll understand why writing generic dataframe code was such a headache (and why it isn't anymore), how Narwhals works and how its community operates, and how you can use it in your projects today. The talk will be technical yet accessible and light-hearted.

More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB

Most Python developers reach for Pandas or Polars when working with tabular data—but DuckDB offers a powerful alternative that’s more than just another DataFrame library. In this tutorial, you’ll learn how to use DuckDB as an in-process analytical database: building data pipelines, caching datasets, and running complex queries with SQL—all without leaving Python. We’ll cover common use cases like ETL, lightweight data orchestration, and interactive analytics workflows. You’ll leave with a solid mental model for using DuckDB effectively as the “SQLite for analytics.”

In this episode, we talk with Orell about his journey from electrical engineering to freelancing in data engineering. Exploring lessons from startup life, working with messy industrial data, the realities of freelancing, and how to stay up to date with new tools.

Topics covered: Why Orel left a PhD and a simulation‑focused start‑up after Covid hitWhat he learned trying (and failing) to commercialise medical‑imaging simulationsThe first freelance project and the long, quiet months that followedHow he now finds clients, keeps projects small and delivers value quicklyTypical work he does for industrial companies: parsing messy machine logs, building simple pipelines, adding structure laterFavorite everyday tools (Python, DuckDB, a bit of C++) and the habit of blocking time for learningAdvice for anyone thinking about freelancing: cash runway, networking, and focusing on problems rather than “perfect” tech choices A practical conversation for listeners who are curious about moving from research or permanent roles into freelance data engineering.

🕒 TIMECODES 0:00 Orel’s career and move to freelancing 9:04 Startup experience and data engineering lessons 16:05 Academia vs. startups and starting freelancing 25:33 Early freelancing challenges and networking 34:22 Freelance data engineering and messy industrial data 43:27 Staying practical, learning tools, and growth 50:33 Freelancing challenges and client acquisition 58:37 Tools, problem-solving, and manual work

🔗 CONNECT WITH ORELL Twitter - https://bsky.app/profile/orgarten.bsk... LinkedIn - / ogarten
Github - https://github.com/orgarten Website - https://orellgarten.com

🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/... Check other upcoming events - https://lu.ma/dtc-events GitHub: https://github.com/DataTalksClub LinkedIn - / datatalks-club
Twitter - / datatalksclub
Website - https://datatalks.club/

🔗 CONNECT WITH ALEXEY Connect with Alexey Twitter - / al_grigor
Linkedin - / agrigorev

This is a free preview of a paid episode. To hear more, visit dataengineeringcentral.substack.com

Hello! A new episode of the Data Engineering Central Podcast is dropping today, we will be covering a few hot topics! * Apache Iceberg Catalogs * new Boring Catalog * new full Iceberg support from Databricks/Unity Catalog * Databricks SQL Scripting * DuckDB coming to a Lake House near you * Lakebase from Databricks Going to be a great show, come along for the ride! Thanks …

User guides are the piece you often hit right after clicking the "Learn" or "Get Started" button in a package's documentation. They're responsible for onboarding new users, and providing a learning path through a package. Surprisingly, while pieces of documentation like the API Reference tend to be the same, the design of user guides tend to differ across packages.

In this talk, I'll discuss how to design an effective user guide for open source software. I'll explain how the guides for Polars, DuckDB, and FastAPI balance working end-to-end like a course, with being browsable like a reference.

Structured Query Language (or SQL for short) is a programming language to manage data in a database system and an essential part of any data engineer’s tool kit. In this tutorial, you will learn how to use SQL to create databases, tables, insert data into them and extract, filter, join data or make calculations using queries. We will use DuckDB, a new open source embedded in-process database system that combines cutting edge database research with dataframe-inspired ease of use. DuckDB is only a pip install away (with zero dependencies), and runs right on your laptop. You will learn how to use DuckDB with your existing Python tools like Pandas, Polars, and Ibis to simplify and speed up your pipelines. Lastly, you will learn how to use SQL to create fast, interactive data visualizations, and how to teach your data how to fly and share it via the Cloud.