DuckDB is well-loved by SQL-ophiles to handle their small data workloads. How do you make it scale? What happens when you feed it Big Data? What is this DuckLake thing I've been hearing about? This talk will help answer these questions from real-world experience running a DuckDB service in the cloud.
talk-data.com
Topic
DuckDB
6
tagged
Activity Trend
Top Events
Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: Simplifying our work.
Some of us may have Big Data, but our quest to derive insights from it is measured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.
📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-is-dead/ Small Data Manifesto: https://motherduck.com/blog/small-data-manifesto/ Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us LinkedIn: https://linkedin.com/company/motherduck X/Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/
Explore the "Small Data" movement, a counter-narrative to the prevailing big data conference hype. This talk challenges the assumption that data scale is the most important feature of every workload, defining big data as any dataset too large for a single machine. We'll unpack why this distinction is crucial for modern data engineering and analytics, setting the stage for a new perspective on data architecture.
Delve into the history of big data systems, starting with the non-linear hardware costs that plagued early data practitioners. Discover how Google's foundational papers on GFS, MapReduce, and Bigtable led to the creation of Hadoop, fundamentally changing how we scale data processing. We'll break down the "big data tax"—the inherent latency and system complexity overhead required for distributed systems to function, a critical concept for anyone evaluating data platforms.
Learn about the architectural cornerstone of the modern cloud data warehouse: the separation of storage and compute. This design, popularized by systems like Snowflake and Google BigQuery, allows storage to scale almost infinitely while compute resources are provisioned on-demand. Understand how this model paved the way for massive data lakes but also introduced new complexities and cost considerations that are often overlooked.
We examine the cracks appearing in the big data paradigm, especially for OLAP workloads. While systems like Snowflake are still dominant, the rise of powerful alternatives like DuckDB signals a shift. We reveal the hidden costs of big data analytics, exemplified by a petabyte-scale query costing nearly $6,000, and argue that for most use cases, it's too expensive to run computations over massive datasets.
The key to efficient data processing isn't your total data size, but the size of your "hot data" or working set. This talk argues that the revenge of the single node is here, as modern hardware can often handle the actual data queried without the overhead of the big data tax. This is a crucial optimization technique for reducing cost and improving performance in any data warehouse.
Discover the core principles for designing systems in a post-big data world. We'll show that since only 1 in 500 users run true big data queries, prioritizing simplicity over premature scaling is key. For low latency, process data close to the user with tools like DuckDB and SQLite. This local-first approach offers a compelling alternative to cloud-centric models, enabling faster, more cost-effective, and innovative data architectures.
Jordan Tigani is back to chat about why small data is awesome, data lakehouses, DuckDB, AI, and much more.
Motherduck: https://motherduck.com/
LinkedIn: https://www.linkedin.com/in/jordantigani/
Twitter: https://twitter.com/jrdntgn?lang=en
Running a full-fledged analytical database inside the client opens up new ways of executing your query; you can run parts of your query locally and part remotely. Once you can split the query plan into two pieces, the same mechanism works with N stages, which can be in series or a tree. This talk discusses the hybrid execution system based on DuckDB built at MotherDuck, but also discusses some further query topologies that are enabled by this pattern.
Speaker: Jordan Tigani, Co-Founder & Chief Duck-Herder, MotherDuck
Register for Coalesce at https://coalesce.getdbt.com
This talk will make the case that the era of Big Data is over. Now we can stop worrying about data size and focus on how we’re going to use it to make better decisions.
The data behind the graphs shown in this talk come from Jordan Tigani having analyzed query logs, deal post-mortems, benchmark results (published and unpublished), customer support tickets, customer conversations, service logs, and published blog posts, plus a bit of intuition.
ABOUT THE SPEAKER: Jordan Tigani is co-founder and chief duck-herder at MotherDuck, a startup building a serverless analytics platform based on DuckDB. He helped create Google BigQuery, wrote two books on it, and led first the engineering team then the product team through its first $1B or so in revenue.
👉 Sign up for our “No BS” Newsletter to get the latest technical data & AI content: https://datacouncil.ai/newsletter
ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.
Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.
FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/
Jordan Tigani is an expert in large-scale data processing, having spent a decade+ in the development and growth of BigQuery, and later SingleStore. Today, Jordan and his team at MotherDuck are in the early days of working on commercial applications for the open source DuckDB OLAP database. In this conversation with Tristan and Julia, Jordan dives into the origin story of BigQuery, why he thinks we should do away with the concept of working in files, and how truly performant "data apps" will require bringing data to an end user's machine (rather than requiring them to query a warehouse directly).