CSV

The Column's the limit: interactive exploration of larger than memory data sets in a notebook with Polars and Buckaroo

2025-12-10 · PyData Boston 2025

talk

by Paddy Mullen

Parquet Polars Rust Data Streaming

Notebooks struggle when data vastly exceeds RAM: pagination hacks, fragile sampling, and surprise OOMs. Buckaroo is a modern data table for notebooks built to quickly make sense of dataframes by providing search, summary stats, and scrolling with every view. This talk reviews how Buckaroo uses out‑of‑core design patterns, viewport streaming, lazy Polars pipelines, batched background stats, and a series cache to make interactive exploration fast and reliable on commodity laptops. We’ll walk through the lifecycle of opening a large Parquet/CSV file: detecting formats, avoiding full materialization, fetching only requested row/column ranges, and throttling UI updates for smoothness. We’ll show how column‑level hashing (via a lightweight Rust extension) enables stable, cache keys so warm loads render the first viewport and stats in under a second. CSV specifics and a practical CSV→Parquet streaming path round out the approach. The ideas are tool‑agnostic and reproducible with the open‑source PyData stack; Buckaroo serves as a concrete reference implementation. You’ll leave with guidelines and snippets to bring these patterns to your own workflows.

From Parsing Nightmares to "Production": Any Unstructured Input → JSON → MotherDuck in Seconds

2025-11-04 · Small Data SF 2025

workshop

by Upal Saha (bem)

API HTML JSON Motherduck

Every sprint consumed by fixing parsers is a sprint spent not shipping product- brittle parsing kills velocity. This workshop is about retiring that cycle so you can move from messy, unstructured inputs to production-ready data in seconds. bem ingests and transforms any unstructured input at any volume — PDFs, emails, Excel, Word, CSV, text, JSON, images (PNG, JPEG, HEIC, HEIF, WebP), HTML, and audio (WAV, MP3, M4A) — into clean JSON instantly via API. With primitives like Transform, Join, Split, Route, and Analyze, you define the exact workflow your product needs. Built-in Evals measure + enforce accuracy automatically so quality doesn’t drop as you scale. Flow outputs straight into MotherDuck so you can go from chaos to query without manual cleanup — and your team can focus on shipping, not scraping.

The No-Upload AI Analyst: Hash, Mask, Redact—AI Analytics Without CSV File Uploads

2025-09-02 · Data & AI with Mukundan | Learn AI by Building Listen

podcast_episode

by Mukundan Sankar

AI/ML Analytics Data Analytics GitHub

AI, data, numbers—without uploads. Hash, mask, and redact PII, then run data analytics locally for time-saving and privacy. In this episode, we build a No-Upload AI Analyst that keeps your PII safe: HMAC SHA-256 hashing, masking, and redaction using policy presets and client-side transforms. We’ll: • Reframe the problem (insights > risk) • Set four hard constraints (no uploads, local preferred, policy presets, human-readable audit) • Use rules-first privacy + schema semantics • Walk the 5-step workflow (paste headers → pick preset → set secret → transform → analyze) • Show real-world cases (HIPAA/HITECH-aware analytics, FERPA contexts, product analytics) • Share a checklist + quiz + local Streamlit approach Perfect for data teams in healthcare, finance, education, and privacy-sensitive orgs. Key takeaways Stop uploading customer data. Transform it client-side first.Use HMAC hashing to keep joins without exposing raw emails/IDs.Mask for human-readable UI; redact when you don’t need the field.Ship a data-handling report with every analysis.Run the app locally for maximum privacy.Affiliate note: I record with Riverside (affiliate) and host on RSS.com (affiliate). Links in show notes. Links Blog version: (Free): https://mukundansankar.substack.com/p/the-no-upload-ai-analyst-v4-secure Join the Discussion (comments hub): https://mukundansankar.substack.com/notesTools I use for my Podcast and Affiliate PartnersRecording Partner: Riverside → Sign up here (affiliate)Host Your Podcast: RSS.com (affiliate )Research Tools: Sider.ai (affiliate)Sourcetable AI: Join Here(affiliate)🔗 Connect with Me:Free Email NewsletterWebsite: Data & AI with MukundanGitHub: https://github.com/mukund14Twitter/X: @sankarmukund475LinkedIn: Mukundan SankarYouTube: Subscribe

Automated ESRS-Tagging Pipeline for CSRD Compliance

2025-07-10 · Data Science Retreat Demo Day #42

talk

bert csrd compliance esrs tagging ixbrl pdf xbrl

This project delivers a fully automated software pipeline that converts raw sustainability reports into ESRS-tagged, XBRL-ready disclosures for CSRD compliance. The tool ingests diverse file formats (PDF, iXBRL, CSV), classifies content using a fine-tuned BERT model, validates completeness and consistency against ESRS rules, and exports compliant XBRL packages. By automating what is traditionally a 6–12-week manual process, the tool reduces turnaround to 1–2 days and lowers costs by up to €500K.

How to Migrate From Oracle to Databricks SQL

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Laurent Léturgez (Databricks)

Databricks DWH Oracle PySpark SQL

Migrating your legacy Oracle data warehouse to the Databricks Data Intelligence Platform can accelerate your data modernization journey. In this session, learn the top strategies for completing this data migration. We will cover data type conversion, basic to complex code conversions, validation and reconciliation best practices. Discover the pros and cons of using CSV files to PySpark or using pipelines to Databricks tables. See before-and-after architectures of customers who have migrated, and learn about the benefits they realized.

FHIR-ing Up Healthcare Data : Restructing Healthcare data with FHIR

2025-04-11 · Google Cloud Next '25

session

by Ashwin Shetty (Google)

BigQuery RDBMS

This talk offers a solution to accelerate healthcare innovation by streamlining the conversion and integration of various data formats (HL7 v2, CSV, RDBMS, etc.) into the FHIR standard.

This solution reduces the need for manual mapping allowing for quick conversion of various healthcare data formats into FHIR and significantly reduces the workload of healthcare IT teams. FHIR data is then loaded into Google BigQuery providing a scalable and secure platform for data storage and analysis.

Loading data into Cloud SQL

2025-04-10 · Google Cloud Next '25

session

Cloud Computing Data Modelling GCP SQL

This hands-on lab guides you through importing real-world data from CSV files into a Cloud SQL database. Using a flight dataset from the US Bureau of Transport Statistics, you'll gain hands-on experience with data ingestion and basic analysis. You'll learn to create a Cloud SQL instance and database, effectively import your data, and build a foundational data model using SQL queries.

If you register for a Learning Center lab, please ensure that you sign up for a Google Cloud Skills Boost account for both your work domain and personal email address. You will need to authenticate your account as well (be sure to check your spam folder!). This will ensure you can arrive and access your labs quickly onsite. You can follow this link to sign up!

Loading data into Cloud SQL

2025-04-09 · Google Cloud Next '25

session

Cloud Computing Data Modelling GCP SQL

This hands-on lab guides you through importing real-world data from CSV files into a Cloud SQL database. Using a flight dataset from the US Bureau of Transport Statistics, you'll gain hands-on experience with data ingestion and basic analysis. You'll learn to create a Cloud SQL instance and database, effectively import your data, and build a foundational data model using SQL queries.

If you register for a Learning Center lab, please ensure that you sign up for a Google Cloud Skills Boost account for both your work domain and personal email address. You will need to authenticate your account as well (be sure to check your spam folder!). This will ensure you can arrive and access your labs quickly onsite. You can follow this link to sign up!

CSVs Will Never Die And OneSchema Is Counting On It

2025-01-13 · Data Engineering Podcast Listen

podcast_episode

by Andrew Luo (OneSchema) , Tobias Macey

AI/ML CRM Data Engineering Data Management Datafold Python SQL

Summary In this episode of the Data Engineering Podcast Andrew Luo, CEO of OneSchema, talks about handling CSV data in business operations. Andrew shares his background in data engineering and CRM migration, which led to the creation of OneSchema, a platform designed to automate CSV imports and improve data validation processes. He discusses the challenges of working with CSVs, including inconsistent type representation, lack of schema information, and technical complexities, and explains how OneSchema addresses these issues using multiple CSV parsers and AI for data type inference and validation. Andrew highlights the business case for OneSchema, emphasizing efficiency gains for companies dealing with large volumes of CSV data, and shares plans to expand support for other data formats and integrate AI-driven transformation packs for specific industries.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Andrew Luo about how OneSchema addresses the headaches of dealing with CSV data for your businessInterview IntroductionHow did you get involved in the area of data management?Despite the years of evolution and improvement in data storage and interchange formats, CSVs are just as prevalent as ever. What are your opinions/theories on why they are so ubiquitous?What are some of the major sources of CSV data for teams that rely on them for business and analytical processes?The most obvious challenge with CSVs is their lack of type information, but they are notorious for having numerous other problems. What are some of the other major challenges involved with using CSVs for data interchange/ingestion?Can you describe what you are building at OneSchema and the story behind it?What are the core problems that you are solving, and for whom?Can you describe how you have architected your platform to be able to manage the variety, volume, and multi-tenancy of data that you process?How have the design and goals of the product changed since you first started working on it?What are some of the major performance issues that you have encountered while dealing with CSV data at scale?What are some of the most surprising things that you have learned about CSVs in the process of building OneSchema?What are the most interesting, innovative, or unexpected ways that you have seen OneSchema used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on OneSchema?When is OneSchema the wrong choice?What do you have planned for the future of OneSchema?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links OneSchemaEDI == Electronic Data InterchangeUTF-8 BOM (Byte Order Mark) CharactersSOAPCSV RFCIcebergSSIS == SQL Server Integration ServicesMS AccessDatafusionJSON SchemaSFTP == Secure File Transfer ProtocolThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

DuckDB: Up and Running

2024-12-12 · O'Reilly Data Science Books O'Reilly Amazon

book

by Wei-Meng Lee

Analytics Data Analytics DuckDB JSON Pandas Parquet Polars Python SQL data data-science data-science-tools

DuckDB, an open source in-process database created for OLAP workloads, provides key advantages over more mainstream OLAP solutions: It's embeddable and optimized for analytics. It also integrates well with Python and is compatible with SQL, giving you the performance and flexibility of SQL right within your Python environment. This handy guide shows you how to get started with this versatile and powerful tool. Author Wei-Meng Lee takes developers and data professionals through DuckDB's primary features and functions, best practices, and practical examples of how you can use DuckDB for a variety of data analytics tasks. You'll also dive into specific topics, including how to import data into DuckDB, work with tables, perform exploratory data analysis, visualize data, perform spatial analysis, and use DuckDB with JSON files, Polars, and JupySQL. Understand the purpose of DuckDB and its main functions Conduct data analytics tasks using DuckDB Integrate DuckDB with pandas, Polars, and JupySQL Use DuckDB to query your data Perform spatial analytics using DuckDB's spatial extension Work with a diverse range of data including Parquet, CSV, and JSON

DuckDB in Action

2024-08-21 · O'Reilly Data Science Books O'Reilly Amazon

book

by Michael Simons , Mark Needham , Michael Hunger

Analytics API Big Data Cloud Computing Data Analytics DuckDB DWH Java JSON Motherduck Neo4j Pandas +8 more

Dive into DuckDB and start processing gigabytes of data with ease—all with no data warehouse. DuckDB is a cutting-edge SQL database that makes it incredibly easy to analyze big data sets right from your laptop. In DuckDB in Action you’ll learn everything you need to know to get the most out of this awesome tool, keep your data secure on prem, and save you hundreds on your cloud bill. From data ingestion to advanced data pipelines, you’ll learn everything you need to get the most out of DuckDB—all through hands-on examples. Open up DuckDB in Action and learn how to: Read and process data from CSV, JSON and Parquet sources both locally and remote Write analytical SQL queries, including aggregations, common table expressions, window functions, special types of joins, and pivot tables Use DuckDB from Python, both with SQL and its "Relational"-API, interacting with databases but also data frames Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Pragmatic and comprehensive, DuckDB in Action introduces the DuckDB database and shows you how to use it to solve common data workflow problems. You won’t need to read through pages of documentation—you’ll learn as you work. Get to grips with DuckDB's unique SQL dialect, learning to seamlessly load, prepare, and analyze data using SQL queries. Extend DuckDB with both Python and built-in tools such as MotherDuck, and gain practical insights into building robust and automated data pipelines. About the Technology DuckDB makes data analytics fast and fun! You don’t need to set up a Spark or run a cloud data warehouse just to process a few hundred gigabytes of data. DuckDB is easily embeddable in any data analytics application, runs on a laptop, and processes data from almost any source, including JSON, CSV, Parquet, SQLite and Postgres. About the Book DuckDB in Action guides you example-by-example from setup, through your first SQL query, to advanced topics like building data pipelines and embedding DuckDB as a local data store for a Streamlit web app. You’ll explore DuckDB’s handy SQL extensions, get to grips with aggregation, analysis, and data without persistence, and use Python to customize DuckDB. A hands-on project accompanies each new topic, so you can see DuckDB in action. What's Inside Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Fast-paced SQL recap: From simple queries to advanced analytics About the Reader For data pros comfortable with Python and CLI tools. About the Authors Mark Needham is a blogger and video creator at @‌LearnDataWithMark. Michael Hunger leads product innovation for the Neo4j graph database. Michael Simons is a Java Champion, author, and Engineer at Neo4j. Quotes I use DuckDB every day, and I still learned a lot about how DuckDB makes things that are hard in most databases easy! - Jordan Tigani, Founder, MotherDuck An excellent resource! Unlocks possibilities for storing, processing, analyzing, and summarizing data at the edge using DuckDB. - Pramod Sadalage, Director, Thoughtworks Clear and accessible. A comprehensive resource for harnessing the power of DuckDB for both novices and experienced professionals. - Qiusheng Wu, Associate Professor, University of Tennessee Excellent! The book all we ducklings have been waiting for! - Gunnar Morling, Decodable

#58 Maximizing Productivity: Bookmarklets, Q Command-Line, RouteLLM, and DuckDB Extensions

2024-07-12 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

AI/ML AWS Glue DuckDB LLM SQL

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style! Bookmarklet Maker: Discover how to automate tasks with the Bookmarklet Maker, a tool for turning scripts into handy browser bookmarks. RouteLLM Framework: Explore the RouteLLM framework by LMSys and Anyscale, designed to optimize the cost-performance ratio of LLM routers. Learn more about this collaboration at LMSys and Anyscale. Q for SQL on CSV/TSV: Meet Q, a command-line tool that lets you run SQL queries directly on CSV or TSV files, simplifying data exploration from your terminal. DuckDB Community Extensions: Check out the latest updates in DuckDB's community extensions and see how this database system is evolving. Apple Intelligence and AI Maximalism: Explore Apple's AI strategy, their avoidance of chat UIs, risk management with OpenAI, and the shift of compute costs to users. Being Glue: Delve into the challenges of being "Glue" at work. Explore why women are more likely to take on non-promotable work and how this affects career progression and workplace dynamics.

Pandas Workout

2024-06-05 · O'Reilly Data Science Books O'Reilly Amazon

book

by Reuven M. Lerner

Data Science JSON Pandas Python data data-science data-science-tools

Practice makes perfect pandas! Work out your pandas skills against dozens of real-world challenges, each carefully designed to build an intuitive knowledge of essential pandas tasks. In Pandas Workout you’ll learn how to: Clean your data for accurate analysis Work with rows and columns for retrieving and assigning data Handle indexes, including hierarchical indexes Read and write data with a number of common formats, such as CSV and JSON Process and manipulate textual data from within pandas Work with dates and times in pandas Perform aggregate calculations on selected subsets of data Produce attractive and useful visualizations that make your data come alive Pandas Workout hones your pandas skills to a professional-level through two hundred exercises, each designed to strengthen your pandas skills. You’ll test your abilities against common pandas challenges such as importing and exporting, data cleaning, visualization, and performance optimization. Each exercise utilizes a real-world scenario based on real-world data, from tracking the parking tickets in New York City, to working out which country makes the best wines. You’ll soon find your pandas skills becoming second nature—no more trips to StackOverflow for what is now a natural part of your skillset. About the Technology Python’s pandas library can massively reduce the time you spend analyzing, cleaning, exploring, and manipulating data. And the only path to pandas mastery is practice, practice, and, you guessed it, more practice. In this book, Python guru Reuven Lerner is your personal trainer and guide through over 200 exercises guaranteed to boost your pandas skills. About the Book Pandas Workout is a thoughtful collection of practice problems, challenges, and mini-projects designed to build your data analysis skills using Python and pandas. The workouts use realistic data from many sources: the New York taxi fleet, Olympic athletes, SAT scores, oil prices, and more. Each can be completed in ten minutes or less. You’ll explore pandas’ rich functionality for string and date/time handling, complex indexing, and visualization, along with practical tips for every stage of a data analysis project. What's Inside Clean data with less manual labor Retrieving and assigning data Process and manipulate text Calculations on selected data subsets About the Reader For Python programmers and data analysts. About the Author Reuven M. Lerner teaches Python and data science around the world and publishes the “Bamboo Weekly” newsletter. He is the author of Manning’s Python Workout (2020). Quotes A carefully crafted tour through the pandas library, jam-packed with wisdom that will help you become a better pandas user and a better data scientist. - Kevin Markham, Founder of Data School, Creator of pandas in 30 days Will help you apply pandas to real problems and push you to the next level. - Michael Driscoll, RFA Engineering, creator of Teach Me Python The explanations, paired with Reuven’s storytelling and personal tone, make the concepts simple. I’ll never get them wrong again! - Rodrigo Girão Serrão, Python developer and educator The definitive source! - Kiran Anantha, Amazon

Data Science Fundamentals with R, Python, and Open Data

2024-04-16 · O'Reilly Data Science Books O'Reilly Amazon

book

by Marco Cremonini

Computer Science Data Science Python programming-languages software-development

Data Science Fundamentals with R, Python, and Open Data Introduction to essential concepts and techniques of the fundamentals of R and Python needed to start data science projects Organized with a strong focus on open data, Data Science Fundamentals with R, Python, and Open Data discusses concepts, techniques, tools, and first steps to carry out data science projects, with a focus on Python and RStudio, reflecting a clear industry trend emerging towards the integration of the two. The text examines intricacies and inconsistencies often found in real data, explaining how to recognize them and guiding readers through possible solutions, and enables readers to handle real data confidently and apply transformations to reorganize, indexing, aggregate, and elaborate. This book is full of reader interactivity, with a companion website hosting supplementary material including datasets used in the examples and complete running code (R scripts and Jupyter notebooks) of all examples. Exam-style questions are implemented and multiple choice questions to support the readers’ active learning. Each chapter presents one or more case studies. Written by a highly qualified academic, Data Science Fundamentals with R, Python, and Open Data discuss sample topics such as: Data organization and operations on data frames, covering reading CSV dataset and common errors, and slicing, creating, and deleting columns in R Logical conditions and row selection, covering selection of rows with logical condition and operations on dates, strings, and missing values Pivoting operations and wide form-long form transformations, indexing by groups with multiple variables, and indexing by group and aggregations Conditional statements and iterations, multicolumn functions and operations, data frame joins, and handling data in list/dictionary format Data Science Fundamentals with R, Python, and Open Data is a highly accessible learning resource for students from heterogeneous disciplines where Data Science and quantitative, computational methods are gaining popularity, along with hard sciences not closely related to computer science, and medical fields using stochastic and quantitative models.

Graph Algorithms for Data Science

2024-02-26 · O'Reilly Data Science Books O'Reilly Amazon

book

by Tomaz Bratanic

AI/ML Data Science NLP SQL data data-science

Practical methods for analyzing your data with graphs, revealing hidden connections and new insights. Graphs are the natural way to represent and understand connected data. This book explores the most important algorithms and techniques for graphs in data science, with concrete advice on implementation and deployment. You don’t need any graph experience to start benefiting from this insightful guide. These powerful graph algorithms are explained in clear, jargon-free text and illustrations that makes them easy to apply to your own projects. In Graph Algorithms for Data Science you will learn: Labeled-property graph modeling Constructing a graph from structured data such as CSV or SQL NLP techniques to construct a graph from unstructured data Cypher query language syntax to manipulate data and extract insights Social network analysis algorithms like PageRank and community detection How to translate graph structure to a ML model input with node embedding models Using graph features in node classification and link prediction workflows Graph Algorithms for Data Science is a hands-on guide to working with graph-based data in applications like machine learning, fraud detection, and business data analysis. It’s filled with fascinating and fun projects, demonstrating the ins-and-outs of graphs. You’ll gain practical skills by analyzing Twitter, building graphs with NLP techniques, and much more. About the Technology A graph, put simply, is a network of connected data. Graphs are an efficient way to identify and explore the significant relationships naturally occurring within a dataset. This book presents the most important algorithms for graph data science with examples from machine learning, business applications, natural language processing, and more. About the Book Graph Algorithms for Data Science shows you how to construct and analyze graphs from structured and unstructured data. In it, you’ll learn to apply graph algorithms like PageRank, community detection/clustering, and knowledge graph models by putting each new algorithm to work in a hands-on data project. This cutting-edge book also demonstrates how you can create graphs that optimize input for AI models using node embedding. What's Inside Creating knowledge graphs Node classification and link prediction workflows NLP techniques for graph construction About the Reader For data scientists who know machine learning basics. Examples use the Cypher query language, which is explained in the book. About the Author Tomaž Bratanič works at the intersection of graphs and machine learning. Arturo Geigel was the technical editor for this book. Quotes Undoubtedly the quickest route to grasping the practical applications of graph algorithms. Enjoyable and informative, with real-world business context and practical problem-solving. - Roger Yu, Feedzai Brilliantly eases you into graph-based applications. - Sumit Pal, Independent Consultant I highly recommend this book to anyone involved in analyzing large network databases. - Ivan Herreros, talentsconnect Insightful and comprehensive. The author’s expertise is evident. Be prepared for a rewarding journey. - Michal Štefaňák, Volke

How Canadian Football League’s data team runs marketing plays with dbt & RudderStack - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Eric Dodds (RudderStack) , Dave Musambi (Canadian Football League)

BI dbt Marketing

Leveraging the power of RudderStack and dbt, Canadian Football League’s (CFL) data team abstracts the complexity of data in the warehouse and provides their marketing team with highly targeted audiences across a large variety of platforms and data sources. During this session we’ll hear how CFL went from manually sharing CSV files to modeling targeted segments, directly focused on OKRs, in their warehouse.

Speakers: Eric Dodds, Head of Product Marketing, RudderStack; Dave Musambi, Sr. Director, Business Intelligence, Canadian Football League

Register for Coalesce at https://coalesce.getdbt.com

Powering Vector Search With Real Time And Incremental Vector Indexes

2023-09-25 · Data Engineering Podcast Listen

podcast_episode

by Louis Brandy , Tobias Macey

AI/ML Analytics BI CI/CD Cloud Computing Data Engineering Data Management Data Quality Datafold dbt Git LLM +7 more

Summary

The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Louis Brandy about building vector indexes in real-time for analytics and AI applications

Interview

Introduction How did you get involved in the area of data management? Can you describe what vector search is and how it differs from other search technologies?

What are the technical challenges related to providing vector search? What are the applications for vector search that merit the added complexity?

Vector databases have been gaining a lot of attention recently with the proliferation of LLM applicati

Building Linked Data Products With JSON-LD

2023-09-17 · Data Engineering Podcast Listen

podcast_episode

by Brian Platz , Tobias Macey

AI/ML Analytics BI CI/CD Cloud Computing Data Engineering Data Management Data Quality Datafold dbt Git JSON +6 more

Summary

A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Brian Platz about using JSON-LD for building linked-data products

Interview

Introduction How did you get involved in the area of data management? Can you describe what the term "linked data product" means and some examples of when you might build one?

What is the overlap between knowledge graphs and "linked data products"?

What is JSON-LD?

What are the domains in which it is typically used? How does it assist in developing linked data products?

what are the characterist

50: FREE 5-Step SQL Course & Project

2023-03-15 · Data Career Podcast: Helping You Land a Data Analyst Job FAST Listen

podcast_episode

by Avery Smith

AI/ML Analytics Data Analytics SQL

📤 In this episode, Avery’s going to walk you through how you can teach yourself SQL for FREE with this awesome 5-step course.

🌟 Join the data project club!

“25OFF” to get 25% off (first 50 members).

📊 Come to my next free “How to Land Your First Data Job” training 🏫 Check out my 10-week data analytics bootcamp

Timestamps:

(0:24) - What is SQL?

(1:08) - Step 1: Download Datasets

(2:03) - What is CSV files?

(2:44) - Step 2: Setup SQL environment with the dataset

(3:37) - Step 3: Learn SQL for free with W3Schools

(4:50) - Step 4: Come up w/ probing questions for your data

(6:09) - Step 5: Write up your findings

(7:00) - Project Write-up Platform

Mentioned Links:

Kaggle: https://www.kaggle.com/datasets

bit.io: https://bit.io/

W3Schools: https://www.w3schools.com/sql/

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

So Fresh and So Clean: Learn How to Build Real-Time Warehouses on Lakehouse

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

by Dillon , Franco , Shannon

Cloud Computing Data Lakehouse Databricks DWH XML

Warehouses? Where we are going, we won't need warehouses! Join Dillon, Franco, and Shannon as they take an industry-standard Data Warehouse integration benchmark, called TPC-DI, which is a typical 80s style data warehouse, and bring it into the future. We will review how to implement standard data warehousing practices on Lakehouse, and show you how to deliver optimal price/performance in the cloud and keep your data so fresh and so clean. We will take an assortment of structured, semi-structured, and unstructured data in the form of CSV, TXT, XML, and Fixed-Width files, and transform them warehouse-style into Lakehouse with a historical load and incremental CDC loads.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

talk-data.com

Activity Trend

Top Events

Top Speakers

The Column's the limit: interactive exploration of larger than memory data sets in a notebook with Polars and Buckaroo

From Parsing Nightmares to "Production": Any Unstructured Input → JSON → MotherDuck in Seconds

The No-Upload AI Analyst: Hash, Mask, Redact—AI Analytics Without CSV File Uploads

Automated ESRS-Tagging Pipeline for CSRD Compliance

How to Migrate From Oracle to Databricks SQL

FHIR-ing Up Healthcare Data : Restructing Healthcare data with FHIR

Loading data into Cloud SQL

Loading data into Cloud SQL

CSVs Will Never Die And OneSchema Is Counting On It

DuckDB: Up and Running

DuckDB in Action

#58 Maximizing Productivity: Bookmarklets, Q Command-Line, RouteLLM, and DuckDB Extensions

Pandas Workout

Data Science Fundamentals with R, Python, and Open Data

Graph Algorithms for Data Science

How Canadian Football League’s data team runs marketing plays with dbt & RudderStack - Coalesce 2023

Powering Vector Search With Real Time And Incremental Vector Indexes

Building Linked Data Products With JSON-LD

50: FREE 5-Step SQL Course & Project

So Fresh and So Clean: Learn How to Build Real-Time Warehouses on Lakehouse