Easy, fast, and scalable: pick 3. MotherDuck’s managed DuckLake data lakehouse blends the cost efficiency, scale, and openness of a lakehouse with the speed of a warehouse for truly joyful dbt pipelines. We will show you how!
Topic
16
tagged
Easy, fast, and scalable: pick 3. MotherDuck’s managed DuckLake data lakehouse blends the cost efficiency, scale, and openness of a lakehouse with the speed of a warehouse for truly joyful dbt pipelines. We will show you how!
By introducing a range of AI-enhanced products that amplify creativity and interactivity across our platforms, Buzzfeed has been able to connect with the largest global audience of young people online to cement its role as the defining digital media company of the AI era. Notably, some of Buzzfeed's most successful tools and content experiences thrive on the power of small, focused datasets. Still wondering how Shrek fits into the picture? You'll have to watch!
Video from: https://smalldatasf.com/
📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Why Small Data?: https://benn.substack.com/p/is-excel-... Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us
LinkedIn: / motherduck
X/Twitter : / motherduck
Bluesky: motherduck.com
Blog: https://motherduck.com/blog/
Discover how BuzzFeed's Data team, led by Gilad Cohen, harnesses AI for creative purposes, leveraging large language models (LLMs) and generative image capabilities to enhance content creation. This video explores how machine learning teams build tools to create new interactive media experiences, focusing on augmenting creative workflows rather than replacing jobs, allowing readers to participate more deeply in the content they consume.
We dive into the core data science problem of understanding what a piece of content is about, a crucial step for improving content recommendation systems. Learn why traditional methods fall short and how the team is constantly seeking smaller, faster, and more performant models. This exploration covers the evolution from earlier architectures like DistilBERT to modern, more efficient approaches for better content representation, clustering, and user personalization.
A key technique explored is the use of text embeddings, which are dense, low-dimensional vector representations of data. This video provides an accessible explanation of embeddings as a form of compressed knowledge, showing how BuzzFeed creates a unique vector for each article. This allows for simple vector math to find semantically similar content, forming a foundational infrastructure for powerful ranking and recommender systems.
Explore how BuzzFeed leverages generative image capabilities to create new interactive formats. The journey began with Midjourney experiments and evolved to building custom tools by fine-tuning a Stable Diffusion XL model using LORA (Low-Rank Approximation). This advanced technique provides greater control over image output, enabling the rapid creation of viral AI generators that respond to trending topics and allow for massive user engagement.
Finally, see a practical application of machine learning for content optimization. BuzzFeed uses its vast historical dataset from Bayesian A/B testing to train a model that predicts headline performance. By generating multiple headline candidates with an LLM like Claude and running them through this predictive model, they can identify the winning headline. This showcases how to use unique, in-house data to build powerful tools that improve click-through rates and drive engagement, pointing to a significant transformation in how media is created and consumed.
As a Head of Data or a one-person data team, keeping the lights on for the business while running all things data-related as efficiently as possible is no small feat. This talk will focus on tactics and strategies to manage within and around constraints, including monetary costs, time and resources, and data volumes.
📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Why Small Data?: https://benn.substack.com/p/is-excel-... Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us
LinkedIn: / motherduck
X/Twitter : / motherduck
Blog: https://motherduck.com/blog/
Learn how your data team can drive innovation and maximize ROI by embracing constraints, drawing inspiration from SpaceX's revolutionary cost-effective approach. This video challenges the "abundance mindset" prevalent in the modern data stack, where easily scalable cloud data warehouses and a surplus of tools often lead to unmanageable data models and underutilized dashboards. We explore a focused data strategy for extracting maximum value from small data, shifting the paradigm from "more data" to more impact.
To maximize value, data teams must move beyond being order-takers and practice strategic stakeholder management. Discover how to use frameworks like the stakeholder engagement matrix to prioritize high-impact business leaders and align your work with core business goals. This involves speaking the language of business growth models, not technical jargon about data pipelines or orchestration, ensuring your data engineering efforts resonate with key decision-makers and directly contribute to revenue-generating activities.
Embracing constraints is key to innovation and effective data project management. We introduce the Iron Triangle—a fundamental engineering concept balancing scope, cost, and time—as a powerful tool for planning data projects and having transparent conversations with the business. By treating constraints not as limitations but as opportunities, data engineers and analysts can deliver higher-quality data products without succumbing to scope creep or uncontrolled costs.
A critical component of this strategy is understanding the Total Cost of Ownership (TCO), which goes far beyond initial compute costs to include ongoing maintenance, downtime, and the risk of vendor pricing changes. Learn how modern, efficient tools like DuckDB and MotherDuck are designed for cost containment from the ground up, enabling teams to build scalable, cost-effective data platforms. By making the true cost of data requests visible, you can foster accountability and make smarter architectural choices. Ultimately, this guide provides a blueprint for resisting data stack bloat and turning cost and constraints into your greatest assets for innovation.
This is a talk about how we thought we had Big Data, and we built everything planning for Big Data, but then it turns out we didn't have Big Data, and while that's nice and fun and seems more chill, it's actually ruining everything, and I am here asking you to please help us figure out what we are supposed to do now.
📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Is Excel Immortal?: https://benn.substack.com/p/is-excel-immortal Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us
LinkedIn: / motherduck
X/Twitter : / motherduck
Blog: https://motherduck.com/blog/
Mode founder David Wheeler challenges the data industry's obsession with "big data," arguing that most companies are actually working with "small data," and our tools are failing us. This talk deconstructs the common sales narrative for BI tools, exposing why the promise of finding game-changing insights through data exploration often falls flat. If you've ever built dashboards nobody uses or wondered why your analytics platform doesn't deliver on its promises, this is a must-watch reality check on the modern data stack.
We explore the standard BI demo, where an analyst uncovers a critical insight by drilling into event data. This story sells tools like Tableau and Power BI, but it rarely reflects reality, leading to a "revolving door of BI" as companies swap tools every few years. Discover why the narrative of the intrepid analyst finding a needle in the haystack only works in movies and how this disconnect creates a cycle of failed data initiatives and unused "trashboards."
The presentation traces our belief that "data is the new oil" back to the early 2010s, with examples from Target's predictive analytics and Facebook's growth hacking. However, these successes were built on truly massive datasets. For most businesses, analyzing small data results in noisy charts that offer vague "directional vibes" rather than clear, actionable insights. We contrast the promise of big data analytics with the practical challenges of small data interpretation.
Finally, learn actionable strategies for extracting real value from the data you actually have. We argue that BI tools should shift focus from data exploration to data interpretation, helping users understand what their charts actually mean. Learn why "doing things that don't scale," like manually analyzing individual customer journeys, can be more effective than complex models for small datasets. This talk offers a new perspective for data scientists, analysts, and developers looking for better data analysis techniques beyond the big data hype.
Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: Simplifying our work.
Some of us may have Big Data, but our quest to derive insights from it is measured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.
📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-is-dead/ Small Data Manifesto: https://motherduck.com/blog/small-data-manifesto/ Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us LinkedIn: https://linkedin.com/company/motherduck X/Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/
Explore the "Small Data" movement, a counter-narrative to the prevailing big data conference hype. This talk challenges the assumption that data scale is the most important feature of every workload, defining big data as any dataset too large for a single machine. We'll unpack why this distinction is crucial for modern data engineering and analytics, setting the stage for a new perspective on data architecture.
Delve into the history of big data systems, starting with the non-linear hardware costs that plagued early data practitioners. Discover how Google's foundational papers on GFS, MapReduce, and Bigtable led to the creation of Hadoop, fundamentally changing how we scale data processing. We'll break down the "big data tax"—the inherent latency and system complexity overhead required for distributed systems to function, a critical concept for anyone evaluating data platforms.
Learn about the architectural cornerstone of the modern cloud data warehouse: the separation of storage and compute. This design, popularized by systems like Snowflake and Google BigQuery, allows storage to scale almost infinitely while compute resources are provisioned on-demand. Understand how this model paved the way for massive data lakes but also introduced new complexities and cost considerations that are often overlooked.
We examine the cracks appearing in the big data paradigm, especially for OLAP workloads. While systems like Snowflake are still dominant, the rise of powerful alternatives like DuckDB signals a shift. We reveal the hidden costs of big data analytics, exemplified by a petabyte-scale query costing nearly $6,000, and argue that for most use cases, it's too expensive to run computations over massive datasets.
The key to efficient data processing isn't your total data size, but the size of your "hot data" or working set. This talk argues that the revenge of the single node is here, as modern hardware can often handle the actual data queried without the overhead of the big data tax. This is a crucial optimization technique for reducing cost and improving performance in any data warehouse.
Discover the core principles for designing systems in a post-big data world. We'll show that since only 1 in 500 users run true big data queries, prioritizing simplicity over premature scaling is key. For low latency, process data close to the user with tools like DuckDB and SQLite. This local-first approach offers a compelling alternative to cloud-centric models, enabling faster, more cost-effective, and innovative data architectures.
Discover how to cut complexity of your dbt data pipelines with serverless DuckDB while improving performance and drastically reducing costs. This session covers practical strategies for cutting complexity and expenses in data flows while enjoying a more ergonomic and frictionless workflow. Learn how adopting a DuckDB-based architecture can streamline your operations, enhance developer experience, and boost efficiency.
Speaker: Alex Monahan Forward Deployed Software Engineer MotherDuck
Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements
As Joe Reis recently opined, if you want to know what’s next in data engineering, just look at the software engineer. The MDS-in-a-box pattern has been a game changer for applying software engineering principles to local data development– improving the ability to share data, collaborate on modeling work and data analysis the same way we build and share open source tooling.
This panel brings together experts in data engineering, data analytics and software engineering to explore the current state of the pattern, pieces that remain missing today and how emerging tools and data engineering testing capabilities can refine the transition from local development to production workflows.
Speakers: Matt Housley, CTO, Halfpipe Systems; Mehdi Ouazza, Developer Advocate, MotherDuck; Sung Won Chung, Solutions Engineer, Datafold; Louise de Leyritz, Host, The Data Couch podcast
Register for Coalesce at https://coalesce.getdbt.com
Running a full-fledged analytical database inside the client opens up new ways of executing your query; you can run parts of your query locally and part remotely. Once you can split the query plan into two pieces, the same mechanism works with N stages, which can be in series or a tree. This talk discusses the hybrid execution system based on DuckDB built at MotherDuck, but also discusses some further query topologies that are enabled by this pattern.
Speaker: Jordan Tigani, Co-Founder & Chief Duck-Herder, MotherDuck
Register for Coalesce at https://coalesce.getdbt.com
ABOUT THE TALK: If you squint, the data warehouse-centric analytics stack can look a lot like mainframe-era centralized control: shared distributed compute, some sandbox space if you are lucky, and no way to work locally that's equivalent. And it's been getting expensive! Yet the major vendors appear to be converging on features. Is this "The End of History" for OLAP? Your laptop is incredibly powerful and tragically underutilized. It has better I/O than servers from just a few years back, and with a solid analytic engine like DuckDB, it can handle surprising workloads.
In this talk Nicholas Ursa will share what makes DuckDB fast and show how far single node scale-up can take you. He will also show some ideas Motherduck is testing that blur the line between local and remote systems and shift some of that centralized control back to users.
ABOUT THIS SPEAKER: Nicholas Ursa is an Engineer at Motherduck. He led data engineering teams at The New York Times and better.com after cutting his teeth in ad tech.
ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.
Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.
FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/
This talk will make the case that the era of Big Data is over. Now we can stop worrying about data size and focus on how we’re going to use it to make better decisions.
The data behind the graphs shown in this talk come from Jordan Tigani having analyzed query logs, deal post-mortems, benchmark results (published and unpublished), customer support tickets, customer conversations, service logs, and published blog posts, plus a bit of intuition.
ABOUT THE SPEAKER: Jordan Tigani is co-founder and chief duck-herder at MotherDuck, a startup building a serverless analytics platform based on DuckDB. He helped create Google BigQuery, wrote two books on it, and led first the engineering team then the product team through its first $1B or so in revenue.
👉 Sign up for our “No BS” Newsletter to get the latest technical data & AI content: https://datacouncil.ai/newsletter
ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.
Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.
FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/
Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: Simplifying our work.
Some of us may have Big Data, but our quest to derive insights from it is measured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.
📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-is-dead/ Small Data Manifesto: https://motherduck.com/blog/small-data-manifesto/ Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us LinkedIn: https://linkedin.com/company/motherduck X/Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/
Explore the "Small Data" movement, a counter-narrative to the prevailing big data conference hype. This talk challenges the assumption that data scale is the most important feature of every workload, defining big data as any dataset too large for a single machine. We'll unpack why this distinction is crucial for modern data engineering and analytics, setting the stage for a new perspective on data architecture.
Delve into the history of big data systems, starting with the non-linear hardware costs that plagued early data practitioners. Discover how Google's foundational papers on GFS, MapReduce, and Bigtable led to the creation of Hadoop, fundamentally changing how we scale data processing. We'll break down the "big data tax"—the inherent latency and system complexity overhead required for distributed systems to function, a critical concept for anyone evaluating data platforms.
Learn about the architectural cornerstone of the modern cloud data warehouse: the separation of storage and compute. This design, popularized by systems like Snowflake and Google BigQuery, allows storage to scale almost infinitely while compute resources are provisioned on-demand. Understand how this model paved the way for massive data lakes but also introduced new complexities and cost considerations that are often overlooked.
We examine the cracks appearing in the big data paradigm, especially for OLAP workloads. While systems like Snowflake are still dominant, the rise of powerful alternatives like DuckDB signals a shift. We reveal the hidden costs of big data analytics, exemplified by a petabyte-scale query costing nearly $6,000, and argue that for most use cases, it's too expensive to run computations over massive datasets.
The key to efficient data processing isn't your total data size, but the size of your "hot data" or working set. This talk argues that the revenge of the single node is here, as modern hardware can often handle the actual data queried without the overhead of the big data tax. This is a crucial optimization technique for reducing cost and improving performance in any data warehouse.
Discover the core principles for designing systems in a post-big data world. We'll show that since only 1 in 500 users run true big data queries, prioritizing simplicity over premature scaling is key. For low latency, process data close to the user with tools like DuckDB and SQLite. This local-first approach offers a compelling alternative to cloud-centric models, enabling faster, more cost-effective, and innovative data architectures.
Slides: https://drive.google.com/file/d/1dSUh_IQgcZSZGOVl6iJ8bdKCmPyMjB41/view?usp=sharing
This is a talk about how we thought we had Big Data, and we built everything planning for Big Data, but then it turns out we didn't have Big Data, and while that's nice and fun and seems more chill, it's actually ruining everything, and I am here asking you to please help us figure out what we are supposed to do now.
📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Is Excel Immortal?: https://benn.substack.com/p/is-excel-immortal Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us
LinkedIn: / motherduck
X/Twitter : / motherduck
Blog: https://motherduck.com/blog/
Mode founder David Wheeler challenges the data industry's obsession with "big data," arguing that most companies are actually working with "small data," and our tools are failing us. This talk deconstructs the common sales narrative for BI tools, exposing why the promise of finding game-changing insights through data exploration often falls flat. If you've ever built dashboards nobody uses or wondered why your analytics platform doesn't deliver on its promises, this is a must-watch reality check on the modern data stack.
We explore the standard BI demo, where an analyst uncovers a critical insight by drilling into event data. This story sells tools like Tableau and Power BI, but it rarely reflects reality, leading to a "revolving door of BI" as companies swap tools every few years. Discover why the narrative of the intrepid analyst finding a needle in the haystack only works in movies and how this disconnect creates a cycle of failed data initiatives and unused "trashboards."
The presentation traces our belief that "data is the new oil" back to the early 2010s, with examples from Target's predictive analytics and Facebook's growth hacking. However, these successes were built on truly massive datasets. For most businesses, analyzing small data results in noisy charts that offer vague "directional vibes" rather than clear, actionable insights. We contrast the promise of big data analytics with the practical challenges of small data interpretation.
Finally, learn actionable strategies for extracting real value from the data you actually have. We argue that BI tools should shift focus from data exploration to data interpretation, helping users understand what their charts actually mean. Learn why "doing things that don't scale," like manually analyzing individual customer journeys, can be more effective than complex models for small datasets. This talk offers a new perspective for data scientists, analysts, and developers looking for better data analysis techniques beyond the big data hype.
Speaker: Frances Perry (MotherDuck) Slides: https://blobs.duckdb.org/events/duckcon5/frances-perry-taking-flight-with-interactive-analytics.pdf
As a Head of Data or a one-person data team, keeping the lights on for the business while running all things data-related as efficiently as possible is no small feat. This talk will focus on tactics and strategies to manage within and around constraints, including monetary costs, time and resources, and data volumes.
📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Why Small Data?: https://benn.substack.com/p/is-excel-... Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us
LinkedIn: / motherduck
X/Twitter : / motherduck
Blog: https://motherduck.com/blog/
Learn how your data team can drive innovation and maximize ROI by embracing constraints, drawing inspiration from SpaceX's revolutionary cost-effective approach. This video challenges the "abundance mindset" prevalent in the modern data stack, where easily scalable cloud data warehouses and a surplus of tools often lead to unmanageable data models and underutilized dashboards. We explore a focused data strategy for extracting maximum value from small data, shifting the paradigm from "more data" to more impact.
To maximize value, data teams must move beyond being order-takers and practice strategic stakeholder management. Discover how to use frameworks like the stakeholder engagement matrix to prioritize high-impact business leaders and align your work with core business goals. This involves speaking the language of business growth models, not technical jargon about data pipelines or orchestration, ensuring your data engineering efforts resonate with key decision-makers and directly contribute to revenue-generating activities.
Embracing constraints is key to innovation and effective data project management. We introduce the Iron Triangle—a fundamental engineering concept balancing scope, cost, and time—as a powerful tool for planning data projects and having transparent conversations with the business. By treating constraints not as limitations but as opportunities, data engineers and analysts can deliver higher-quality data products without succumbing to scope creep or uncontrolled costs.
A critical component of this strategy is understanding the Total Cost of Ownership (TCO), which goes far beyond initial compute costs to include ongoing maintenance, downtime, and the risk of vendor pricing changes. Learn how modern, efficient tools like DuckDB and MotherDuck are designed for cost containment from the ground up, enabling teams to build scalable, cost-effective data platforms. By making the true cost of data requests visible, you can foster accountability and make smarter architectural choices. Ultimately, this guide provides a blueprint for resisting data stack bloat and turning cost and constraints into your greatest assets for innovation.
By introducing a range of AI-enhanced products that amplify creativity and interactivity across our platforms, Buzzfeed has been able to connect with the largest global audience of young people online to cement its role as the defining digital media company of the AI era. Notably, some of Buzzfeed's most successful tools and content experiences thrive on the power of small, focused datasets. Still wondering how Shrek fits into the picture? You'll have to watch!
Video from: https://smalldatasf.com/
📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Why Small Data?: https://benn.substack.com/p/is-excel-... Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us
LinkedIn: / motherduck
X/Twitter : / motherduck
Bluesky: motherduck.com
Blog: https://motherduck.com/blog/
Discover how BuzzFeed's Data team, led by Gilad Cohen, harnesses AI for creative purposes, leveraging large language models (LLMs) and generative image capabilities to enhance content creation. This video explores how machine learning teams build tools to create new interactive media experiences, focusing on augmenting creative workflows rather than replacing jobs, allowing readers to participate more deeply in the content they consume.
We dive into the core data science problem of understanding what a piece of content is about, a crucial step for improving content recommendation systems. Learn why traditional methods fall short and how the team is constantly seeking smaller, faster, and more performant models. This exploration covers the evolution from earlier architectures like DistilBERT to modern, more efficient approaches for better content representation, clustering, and user personalization.
A key technique explored is the use of text embeddings, which are dense, low-dimensional vector representations of data. This video provides an accessible explanation of embeddings as a form of compressed knowledge, showing how BuzzFeed creates a unique vector for each article. This allows for simple vector math to find semantically similar content, forming a foundational infrastructure for powerful ranking and recommender systems.
Explore how BuzzFeed leverages generative image capabilities to create new interactive formats. The journey began with Midjourney experiments and evolved to building custom tools by fine-tuning a Stable Diffusion XL model using LORA (Low-Rank Approximation). This advanced technique provides greater control over image output, enabling the rapid creation of viral AI generators that respond to trending topics and allow for massive user engagement.
Finally, see a practical application of machine learning for content optimization. BuzzFeed uses its vast historical dataset from Bayesian A/B testing to train a model that predicts headline performance. By generating multiple headline candidates with an LLM like Claude and running them through this predictive model, they can identify the winning headline. This showcases how to use unique, in-house data to build powerful tools that improve click-through rates and drive engagement, pointing to a significant transformation in how media is created and consumed.