talk-data.com talk-data.com

Event

Small Data SF 2024

2024-01-01 – 2024-09-25 Small Data SF Visit website ↗

Activities tracked

3

Main-stage program for Small Data SF featuring 15 talks, a fireside chat, and a closing panel on data minimalism.

Filtering by: Data Engineering ×

Sessions & talks

Showing 1–3 of 3 · Newest first

Search within this event →
Big Data is Dead: Long Live Hot Data 🔥

Big Data is Dead: Long Live Hot Data 🔥

video

Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: Simplifying our work.

Some of us may have Big Data, but our quest to derive insights from it is measured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-is-dead/ Small Data Manifesto: https://motherduck.com/blog/small-data-manifesto/ Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: https://linkedin.com/company/motherduck X/Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/


Explore the "Small Data" movement, a counter-narrative to the prevailing big data conference hype. This talk challenges the assumption that data scale is the most important feature of every workload, defining big data as any dataset too large for a single machine. We'll unpack why this distinction is crucial for modern data engineering and analytics, setting the stage for a new perspective on data architecture.

Delve into the history of big data systems, starting with the non-linear hardware costs that plagued early data practitioners. Discover how Google's foundational papers on GFS, MapReduce, and Bigtable led to the creation of Hadoop, fundamentally changing how we scale data processing. We'll break down the "big data tax"—the inherent latency and system complexity overhead required for distributed systems to function, a critical concept for anyone evaluating data platforms.

Learn about the architectural cornerstone of the modern cloud data warehouse: the separation of storage and compute. This design, popularized by systems like Snowflake and Google BigQuery, allows storage to scale almost infinitely while compute resources are provisioned on-demand. Understand how this model paved the way for massive data lakes but also introduced new complexities and cost considerations that are often overlooked.

We examine the cracks appearing in the big data paradigm, especially for OLAP workloads. While systems like Snowflake are still dominant, the rise of powerful alternatives like DuckDB signals a shift. We reveal the hidden costs of big data analytics, exemplified by a petabyte-scale query costing nearly $6,000, and argue that for most use cases, it's too expensive to run computations over massive datasets.

The key to efficient data processing isn't your total data size, but the size of your "hot data" or working set. This talk argues that the revenge of the single node is here, as modern hardware can often handle the actual data queried without the overhead of the big data tax. This is a crucial optimization technique for reducing cost and improving performance in any data warehouse.

Discover the core principles for designing systems in a post-big data world. We'll show that since only 1 in 500 users run true big data queries, prioritizing simplicity over premature scaling is key. For low latency, process data close to the user with tools like DuckDB and SQLite. This local-first approach offers a compelling alternative to cloud-centric models, enabling faster, more cost-effective, and innovative data architectures.

Build Bigger With Small Ai: Running Small Models Locally

Build Bigger With Small Ai: Running Small Models Locally

video

It's finally possible to bring the awesome power of Large Language Models (LLMs) to your laptop. This talk will explore how to run and leverage small, openly available LLMs to power common tasks involving data, including selecting the right models, practical use cases for running small models, and best practices for deploying small models effectively alongside databases.

Bio: Jeffrey Morgan is the founder of Ollama, an open-source tool to get up and run large language models. Prior to founding Ollama, Jeffrey founded Kitematic, which was acquired by Docker and evolved into Docker Desktop. He has previously worked at companies including Docker, Twitter, and Google.

➡️ Follow Us LinkedIn: https://www.linkedin.com/company/small-data-sf/ X/Twitter : https://twitter.com/smalldatasf Website: https://www.smalldatasf.com/

Discover how to run large language models (LLMs) locally using Ollama, the easiest way to get started with small AI models on your Mac, Windows, or Linux machine. Unlike massive cloud-based systems, small open source models are only a few gigabytes, allowing them to run incredibly fast on consumer hardware without network latency. This video explains why these local LLMs are not just scaled-down versions of larger models but powerful tools for developers, offering significant advantages in speed, data privacy, and cost-effectiveness by eliminating hidden cloud provider fees and risks.

Learn the most common use case for small models: combining them with your existing factual data to prevent hallucinations. We dive into retrieval augmented generation (RAG), a powerful technique where you augment a model's prompt with information from a local data source. See a practical demo of how to build a vector store from simple text files and connect it to a model like Gemma 2B, enabling you to query your own data using natural language for fast, accurate, and context-aware responses.

Explore the next frontier of local AI with small agents and tool calling, a new feature that empowers models to interact with external tools. This guide demonstrates how an LLM can autonomously decide to query a DuckDB database, write the correct SQL, and use the retrieved data to answer your questions. This advanced tutorial shows you how to connect small models directly to your data engineering workflows, moving beyond simple chat to create intelligent, data-driven applications.

Get started with practical applications for small models today, from building internal help desks to streamlining engineering tasks like code review. This video highlights how small and large models can work together effectively and shows that open source models are rapidly catching up to their cloud-scale counterparts. It's never been a better time for developers and data analysts to harness the power of local AI.

Think Inside the Box: Constraints Drive Data Warehousing Innovation

Think Inside the Box: Constraints Drive Data Warehousing Innovation

video

As a Head of Data or a one-person data team, keeping the lights on for the business while running all things data-related as efficiently as possible is no small feat. This talk will focus on tactics and strategies to manage within and around constraints, including monetary costs, time and resources, and data volumes.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Why Small Data?: https://benn.substack.com/p/is-excel-... Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: / motherduck
X/Twitter : / motherduck
Blog: https://motherduck.com/blog/


Learn how your data team can drive innovation and maximize ROI by embracing constraints, drawing inspiration from SpaceX's revolutionary cost-effective approach. This video challenges the "abundance mindset" prevalent in the modern data stack, where easily scalable cloud data warehouses and a surplus of tools often lead to unmanageable data models and underutilized dashboards. We explore a focused data strategy for extracting maximum value from small data, shifting the paradigm from "more data" to more impact.

To maximize value, data teams must move beyond being order-takers and practice strategic stakeholder management. Discover how to use frameworks like the stakeholder engagement matrix to prioritize high-impact business leaders and align your work with core business goals. This involves speaking the language of business growth models, not technical jargon about data pipelines or orchestration, ensuring your data engineering efforts resonate with key decision-makers and directly contribute to revenue-generating activities.

Embracing constraints is key to innovation and effective data project management. We introduce the Iron Triangle—a fundamental engineering concept balancing scope, cost, and time—as a powerful tool for planning data projects and having transparent conversations with the business. By treating constraints not as limitations but as opportunities, data engineers and analysts can deliver higher-quality data products without succumbing to scope creep or uncontrolled costs.

A critical component of this strategy is understanding the Total Cost of Ownership (TCO), which goes far beyond initial compute costs to include ongoing maintenance, downtime, and the risk of vendor pricing changes. Learn how modern, efficient tools like DuckDB and MotherDuck are designed for cost containment from the ground up, enabling teams to build scalable, cost-effective data platforms. By making the true cost of data requests visible, you can foster accountability and make smarter architectural choices. Ultimately, this guide provides a blueprint for resisting data stack bloat and turning cost and constraints into your greatest assets for innovation.