talk-data.com talk-data.com

Event

Small Data SF 2025

2025-11-04 – 2025-11-06 Small Data SF Visit website ↗

Activities tracked

31

Sessions & talks

Showing 26–31 of 31 · Newest first

Search within this event →

From Parsing Nightmares to "Production": Any Unstructured Input → JSON → MotherDuck in Seconds

2025-11-04
workshop
Upal Saha (bem)

Every sprint consumed by fixing parsers is a sprint spent not shipping product- brittle parsing kills velocity. This workshop is about retiring that cycle so you can move from messy, unstructured inputs to production-ready data in seconds. bem ingests and transforms any unstructured input at any volume — PDFs, emails, Excel, Word, CSV, text, JSON, images (PNG, JPEG, HEIC, HEIF, WebP), HTML, and audio (WAV, MP3, M4A) — into clean JSON instantly via API. With primitives like Transform, Join, Split, Route, and Analyze, you define the exact workflow your product needs. Built-in Evals measure + enforce accuracy automatically so quality doesn’t drop as you scale. Flow outputs straight into MotherDuck so you can go from chaos to query without manual cleanup — and your team can focus on shipping, not scraping.

From Zero to "Query": Building Your First Serverless Lakehouse with DuckLake

2025-11-04
workshop
Jacob Matson (MotherDuck)

The lakehouse promised to unify our data, but popular formats can feel bloated and hard to use for most real-world workloads. If you've ever felt that the complexity and operational overhead of "Big Data" tools are overkill, you're not alone. What if your lakehouse could be simple, fast, and maybe even a little fun? Enter DuckLake , the native lakehouse format, managed on MotherDuck. It delivers the powerful features you need like ACID transactions, time travel, and schema evolution without the heavyweight baggage. This approach truly makes massive data sets feel like Small Data. This workshop is a practical, step-by-step walkthrough for the data practitioner. We'll get straight to the point and show you how to build a fully functional, serverless lakehouse from scratch. You will learn: The Architecture: We’ll explore how DuckLake's design choices make it fundamentally simpler and faster for analytical queries compared to its JVM-based cousins. The Workflow: Through hands-on examples, you'll create a DuckLake table, perform atomic updates, and use time travel—all with the simple SQL you already know. The MotherDuck Advantage: Discover how the serverless platform makes it easy to manage, share, and query your DuckLake tables, enabling a seamless hybrid workflow between your laptop and the cloud.

Just-in-Time Insights with "Estuary": Real-Time Data Streaming Made Simple

2025-11-04
workshop
Zulfikar Qureshi (Estuary)

Gain a clear understanding of Estuary and its role in real-time data integration. The session will begin with an overview of the platform and how it works, then move into the distinctive advantages that set Estuary apart in today’s data landscape. From there, you’ll explore practical use cases that demonstrate how organizations are leveraging real-time data to drive meaningful outcomes. We’ll close by examining why Estuary has become the leading choice for loading data into MotherDuck, highlighting the speed, reliability, and simplicity it delivers. Gain hands-on experience with Estuary by completing a guided lab exercise: Setting up a source connection and capturing data in real time. Configuring a MotherDuck connection and materializing the data. Moving live, streaming data end-to-end.

Keep it Simple and "Scalable": pythonic Extract, Load, Transform (ELT) using dltHub

2025-11-04
workshop
Elvis Kahoro (Chalk) , Brian Douglas (Continue) , Thierry Jean (dltHub)

Get ready to ingest data and transform it into ready-to-use datasets using Python. We'll share a no-nonsense approach for developing and testing data connectors and transformations locally. Moving to production will be a matter of tweaking your configuration. In the end, you get a simple dataset interface to build dashboards & applications, train predictive models, or create agentic workflows. This workshop includes two guest speakers. Brian teach how to leverage AI IDEs, MCP servers and LLM scaffoldings to create ingestion pipelines. Elvis will show how to interactively define transformations and data quality checks.

Open Data Science Agent

2025-11-04
workshop
Zain Hasan (Together.AI)

Learn to build an autonomous data science agent from scratch using open-source models and modern AI tools. This hands-on workshop will guide you through implementing a ReAct-based agent that can perform end-to-end data analysis tasks, from data cleaning to model training, using natural language reasoning and Python code generation. We'll explore the CodeAct framework, where the agent "thinks" through problems and then generates executable Python code as actions. You'll discover how to safely execute AI-generated code using Together Code Interpreter, creating a modular and maintainable system that can handle complex analytical workflows. Perfect for data scientists, ML engineers, and developers interested in agentic AI, this workshop combines practical implementation with best practices for building reasoning-driven AI assistants. By the end, you'll have a working data science agent and understand the fundamentals of agent architecture design. What you'll learn: ReAct framework implementation Safe code execution in AI systems Agent evaluation and optimization techniques Building transparent, "hackable" AI agents No advanced AI background required, just familiarity with Python and data science concepts.

Stop Measuring LLM Accuracy, Start Building Context

2025-11-04
workshop

Everyone's trying to make LLMs "accurate." But the real challenge isn't accuracy — it's context. We'll explore why traditional approaches like evals suites or synthetic question sets fall short, and how successful AI systems are built instead through compounding context over time. Hex enables a new workflow for conversational analytics that grows smarter with every interaction. With Hex's Notebook Agent and Threads, business users define the questions that matter while data teams refine, audit, and operationalize them into durable, trusted workflows. In this model, "tests" aren't written in isolation by data teams — they're defined by the business and operationalized through data workflows. The result is a living system of context — not a static set of prompts or tests — that evolves alongside your organization. Join us for a candid discussion on what's working in production AI systems, and get hands-on building context-aware analytical workflows in Hex!