Processing large JSON files without running out of memory

2025-12-10 · PyData Boston 2025 Watch

talk

by Itamar Turner-Trauring

Python

If you need to process a large JSON file in Python, it’s very easy to run out of memory while loading the data, leading to a super-slow run time or out-of-memory crashes. In this talk you'll learn:

How to measure memory usage.
Why loading JSON takes a lot of memory.
Four different ways to reduce memory usage when loading large JSON files.

Move fast, save more with MongoDB-compatible workloads on DocumentDB

2025-11-19 · Microsoft Ignite 2025 Watch

breakout

by Patty Chow (Microsoft) , Khelan Modi (Microsoft) , Gurvinder Singh (The Kraft Heinz Company)

Azure Linux MongoDB

DocumentDB, the open-source MongoDB-compatible document database now part of the Linux Foundation, helps you innovate faster and save more. Customers like Kraft Heinz move fast with a JSON-native model, reduce ops with turnkey scaling and updates, and secure workloads with enterprise-grade protection and an E2E Azure SLA. Delivered as a fully managed service with support for hybrid and multicloud, Azure DocumentDB keeps you moving faster while crushing costs at enterprise scale.

SQL Server 2025: The AI-ready enterprise database

2025-11-18 · Microsoft Ignite 2025 Watch

breakout

by Bob Ward (Azure Data) , Sirjad Parakkat (Ivanti) , Abhinav Tiwari (Ivanti)

AI/ML Analytics API Fabric SQL

SQL Server 2025 redefines what's possible for the enterprise data platform. With developer-first features and seamless integration with analytics and AI models, SQL Server 2025 accelerates AI innovation using the data you already own. Build modern apps with native JSON and REST APIs and harness AI with built-in vector search. Increase application availability with optimized locking and use Fabric mirroring for near real-time analytics. Join us to see why this is the most advanced SQL Server.

Wrangling Internet-scale Image Datasets

2025-11-07 · PyData Seattle 2025 Watch

talk

by Nicholas Merchant , Carlos Garcia Jurado Suarez

Data Quality Parquet

Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible. In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable. Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.

Model Context Protocol: Principles and Practice

2025-09-26 · PyData Amsterdam 2025 Watch

talk

by Fabio Lipreri , Gabriele Orlandi

GenAI Git GitHub LLM Cyber Security SQL Data Streaming postgresql

Large‑language‑model agents are only as useful as the context and tools they can reach.

Anthropic’s Model Context Protocol (MCP) proposes a universal, bidirectional interface that turns every external system—SQL databases, Slack, Git, web browsers, even your local file‑system—into first‑class “context providers.”

In just 30 minutes we’ll step from high‑level buzzwords to hands‑on engineering details:

How MCP’s JSON‑RPC message format, streaming channels, and version‑negotiation work under the hood.
Why per‑tool sandboxing via isolated client processes hardens security (and what happens when an LLM tries rm ‑rf /).
Techniques for hierarchical context retrieval that stretch a model’s effective window beyond token limits.
Real‑world patterns for accessing multiple tools—Postgres, Slack, GitHub—and plugging MCP into GenAI applications.

Expect code snippets and lessons from early adoption.

You’ll leave ready to wire your own services into any MCP‑aware model and level‑up your GenAI applications—without the N×M integration nightmare.

Spark 4.0 and Delta 4.0 For Streaming Data

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Bryce Bartmann (Shell)

AI/ML Delta Python Spark Data Streaming

Real-time data is one of the most important datasets for any Data and AI Platform across any industry. Spark 4.0 and Delta 4.0 include new features that make ingestion and querying of real-time data better than ever before. Features such as: Python custom data sources for simple ingestion of streaming and batch time series data sources using Spark Variant types for managing variable data types and json payloads that are common in the real time domain Delta liquid clustering for simple data clustering without the overhead or complexity of partitioning In this presentation you will learn how data teams can leverage these latest features to build industry-leading, real-time data products using Spark and Delta and includes real world examples and metrics of the improvements they make in performance and processing of data in the real time space.

Advanced JSON Schema handing and Event Demuxing

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Dattatraya Walake (Databricks) , Murali Talluri (Databricks)

JSON Schema

This session explores advanced JSON Schema handing(inference and evolving), and event DemuxingTopics include: How from_json is currently used today and its challenges. How to use Variant for rapidly changing schema. How from_json in Lakeflow Declarative Pipelines with primed schema helps simplify schema handling. Demultiplexing patterns for scalable stream processing. Simply event Demuxing with Lakeflow Declarative Pipelines.

AWS re:Invent 2024 - Deep dive into Amazon DocumentDB and its innovations (DAT324)

2024-12-05 · AWS re:Invent 2024 Watch

video

by Cody Allen (AWS) , Vin Yu (AWS)

Agile/Scrum API AWS Cloud Computing MongoDB

Amazon DocumentDB (with MongoDB compatibility) is a fully managed native JSON document database that makes it easy and cost-effective to operate critical document workloads at virtually any scale without managing infrastructure. In this session, take a deep dive into the most exciting new features Amazon DocumentDB offers including global cluster failover, global cluster switchover, compression, and the latest query APIs. Learn how the implementation of these features in your organization can improve resilience, performance, and the effectiveness of your applications.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

Roberto Freato: Green Must Be Convenient

2023-12-04 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Roberto Freato

Azure Big Data Data Lake SQL

Join Roberto Freato in his session 'Green Must Be Convenient' as he unveils the evolution of database storage practices, demonstrating how Azure SQL Database efficiently manages vast amounts of JSON objects. Discover how this bridges the gap between raw data in a data lake and the relational view used by analytical applications. 🌱💾 #DatabaseStorage #azuresql

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Scaling dbt models for CDC on large databases - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Santona Tuli (Upsolver)

dbt

Unlike transforming staged data to marts, ingesting data into staging requires robustness to data volume and type changes, schema evolution, and data drift. Especially when performing change data capture (CDC) on large databases (~100 tables to a database), we’ll ideally reinforce our dbt models with automatic:

Mapping of dynamic columns and data types between the source and the target stag
evolution of stage table schemas at pace with incoming data, including for nested data structures
parsing and flattening of any arrays and JSON structs in the data.

Manually performing these tasks for data at scale is a tall order due to the many permutations with which CDC data can deviate. Waiting to implement them in mart transformation models is potentially detrimental to the business, as well as doesn’t reduce the complexity. Santona Tuli shares learnings from integrating dbt Core into high-scale data ingestion workloads, including trade-offs between ease-of-use and scale.

Speaker: Santona Tuli, Head/Director of Data, Upsolver

Register for Coalesce at https://coalesce.getdbt.com

Using JSON schema to set the (dbt) stage for product analytics - Coalesce 2023

2023-10-24 · dbt Coalesce 2023 Watch

video

by Greg Clunies (Surfline)

Analytics Data Contracts Data Quality dbt JSON Schema

Surfline uses Segment to collect product analytics events to understand how surfers use their forecasts and live surf cameras across 9000+ surf spots worldwide. An open source tool was developed to define and manage product analytics event schemas using JSON schema which are used to build dbt staging models for all events.

With this solution, the data team has more time to build intermediate and mart models in dbt, knowing that our staging layer fully reflects Surfline’s product analytics events. This presentation is a real-life example on how schemas (or data contracts) can be used as a medium to build consensus, enforce standards, improve data quality, and speed up the dbt workflow for product analytics.

Speaker: Greg Clunies, Senior Analytics Engineer, Surfline

Register for Coalesce at https://coalesce.getdbt.com/

Streaming Schema Drift Discovery and Controlled Mitigation

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Alexander Vanadio

Databricks Delta Data Streaming

When creating streaming workloads with Databricks, it can sometimes be difficult to capture and understand the current structure of your source data. For example, what happens if you are ingesting JSON events from a vendor, and the keys are very sparsely populated, or contain dynamic content? Ideally, data engineers want to "lock in" a target schema in order to minimize complexity and maximize performance for known access patterns. What do you do when your data sources just don't cooperate with that vision? The first step is to quantify how far your current source data is drifting from your established Delta table. But how?

This session will demonstrate a way to capture and visual drift across all your streaming tables. The next question is, "Now that I see all of the data I'm missing, how do I selectively promote some of these keys into DataFrame columns?" The second half of this session will demonstrate precisely how to do a schema migration with minimal job downtime.

Talk by: Alexander Vanadio

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Extraction and Sharing Via The Delta Sharing Protocol

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Roger Dunn

Cloud Computing Data Lakehouse Databricks Delta Parquet SQL

The Delta Sharing open protocol for secure sharing and distribution of Lakehouse data is designed to reduce friction in getting data to users. Delivering custom data solutions from this protocol further leverages the technical investment committed to your Delta Lake infrastructure. There are key design and computational concepts unique to Delta Sharing to know when undertaking development. And there are pitfalls and hazards to avoid when delivering modern cloud data to traditional data platforms and users.

In this session, we introduce Delta Sharing Protocol development and examine our journey and the lessons learned while creating the Delta Sharing Excel Add-in. We will demonstrate scenarios of overfetching, underfetching, and interpretation of types. We will suggest methods to overcome these development challenges. The session will combine live demonstrations that exercise the Delta Sharing REST protocol with detailed analysis of the responses. The demonstrations will elaborate on optional capabilities of the protocol’s query mechanism, and how they are used and interpreted in real-life scenarios. As a reference baseline for data professionals, the Delta Sharing exercises will be framed relative to SQL counterparts. Specific attention will be paid to how they differ, and how Delta Sharing’s Change Data Feed (CDF) can power next-generation data architectures. The session will conclude with a survey of available integration solutions for getting the most out of your Delta Sharing environment, including frameworks, connectors, and managed services.

Attendees are encouraged to be familiar with REST, JSON, and modern programming concepts. A working knowledge of Delta Lake, the Parquet file format, and the Delta Sharing Protocol are advised.

Talk by: Roger Dunn

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

A Deep Dive into the dbt Manifest | Squarespace

2023-05-11 · Data Council 2023 Watch

video

by Aaron Richter (Squarespace)

AI/ML Analytics Cloud Computing Data Engineering Data Science dbt DWH

ABOUT THE TALK: Ever noticed the manifest.json file that dbt puts into your target folder? This little file contains rich information about your dbt project that enables numerous fun use cases! These include complex deployment configurations, quality enforcement, and streamlined development workflows. This talk will go over what the manifest is and how it is produced, along with case studies of how the manifest is used across the community and in Squarespace’s data pipelines.

ABOUT THE SPEAKER: Aaron Richter is a software developer with a passion for all things data. His work involves making sure data is clean and accessible, and that the tools to access it are at peak performance. Aaron is currently a data engineer at Squarespace, where he supports the company’s analytics platform. Previously, he built the data warehouse at Modernizing Medicine, and worked as a data science advocate at Saturn Cloud.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Powering Up the Business with a Lakehouse

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

CI/CD Data Lakehouse Data Quality Databricks Delta GDPR/CCPA

Within Wehkamp we required a uniform way to provide reliable and on time data to the business, while making this access compliant with GDPR. Unlocking all the data sources that we have scattered across the company and democratize the data access was of the utmost importance, allowing us to empower the business with more, better and faster data.

Focusing on open source technologies, we've built a data platform almost from the ground up that focuses on 3 levels of data curation - bronze, silver and gold - which follows the LakeHouse Architecture. The ingestion into bronze is where the PII fields are pseudonymized, making the use of the data within the delta lake compliant and, since there is no visible user data, it means everyone can use the entire delta lake for exploration and new use cases. Naturally, specific teams are allowed to see some user data that is necessary for their use cases. Besides the standard architecture, we've developed a library that allows us to ingest new data sources by adding a JSON config file with the characteristics. This combined with the ACID transactions that delta provides and the efficient Structured Stream provided through Auto Loader has allowed a small team to maintain 100+ streams with insignificant downtime.

Some other components of this platform are the following: - Alerting to Slack - Data quality checks - CI/CD - Stream processing with the delta engine

The feedback so far has been encouraging, as more and more teams across the company are starting to use the new platform and taking advantage of all its perks. It is still a long time until we get to turn off some of the components of the old data platform, but it has come a long way.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Opening the Floodgates: Enabling Fast, Unmediated End User Access to Trillion-Row Datasets with SQL

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Analytics API ClickHouse Databricks Druid Marketing SQL

Spreadsheets revolutionized IT by giving end users the ability to create their own analytics. Providing direct end user access to trillion-row datasets generated in financial markets or digital marketing is much harder. New SQL data warehouses like ClickHouse and Druid can provide fixed latency with constant cost on very large datasets, which opens up new possibilities.

Our talk walks through recent experience on analytic apps developed by ClickHouse users that enable end users like market traders to develop their own analytics directly off raw data. We’ll cover the following topics.

Characteristics of new open source column databases and how they enable low-latency analytics at constant cost.
Idiomatic ways to validate new apps by building MVPs that support a wide range of queries on source data including storing source JSON, schema design, applying compression on columns, and building indexes for needle-in-a-haystack queries.
Incrementally identifying hotspots and applying easy optimizations to bring query performance into line with long term latency and cost requirements.
Methods of building accessible interfaces, including traditional dashboards, imitating existing APIs that are already known, and creating app-specific visualizations.

We’ll finish by summarizing a few of the benefits we’ve observed and also touch on ways that analytic infrastructure could be improved to make end user access even more productive. The lessons are as general as possible so that they can be applied across a wide range of analytic systems, not just ClickHouse.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Cloud Computing Databricks Spark

Spark history server is an essential tool for monitoring, analyzing and optimizing spark jobs.

The original history server is based on Spark event log mechanism. A running Spark job will produce many kinds of events that describe the job's status continuously. All the events are serialized into JSON and appended to a file —— event log. The history server has to replay the event log and rebuild the memory store needed for UI. In a cluster, the history server also needs to periodically scan the event log directory and cache all the files' metadata in memory.

Actually, an event log contains too much redundant info for a history server. A long-running application can bring a huge event log which may cost a lot to maintain and require a long time to replay. In large-scale production, the number of jobs can be large and leads to a heavy burden on history servers. It needs additional development to build a scalable history server service.

In this talk, we want to introduce a new history server based on UIMeta. UIMeta is a wrapper of the KVStore objects needed by a Spark UI. A job will bring a UIMeta log by stagedly serializing UIMeta. An UIMeta log is approximately 10x smaller in size and 10x faster in replaying compared to the original event log file. Benefitting from the good performance, we develop a new stateless history server without a directory scan. Currently, UIMeta Service has taken the place of the original history server and provided service for millions of jobs per day in Bytedance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Analytics on your analytics, Drizly

2020-12-11 · dbt Coalesce 2020 Watch

video

by Emily Hawkins

Analytics Dagster dbt Looker

Using dbt's metadata on dbt runs (run_results.json) Drizly analytics is able to track, monitor, and alert on its dbt models using Looker to visualize the data. In this video, Emily Hawkins covers how Drizly did this before, using dbt macros and inserts, and how the process was improved using run_results.json in conjunction with Dagster (and teamwork with Fishtown Analytics!)

talk-data.com

JSON

Activity Trend

Top Events

Top Speakers

Processing large JSON files without running out of memory

Move fast, save more with MongoDB-compatible workloads on DocumentDB

SQL Server 2025: The AI-ready enterprise database

Wrangling Internet-scale Image Datasets

Model Context Protocol: Principles and Practice

Spark 4.0 and Delta 4.0 For Streaming Data

Advanced JSON Schema handing and Event Demuxing

AWS re:Invent 2024 - Deep dive into Amazon DocumentDB and its innovations (DAT324)

AWSreInvent #AWSreInvent2024

Roberto Freato: Green Must Be Convenient

Scaling dbt models for CDC on large databases - Coalesce 2023

Using JSON schema to set the (dbt) stage for product analytics - Coalesce 2023

Streaming Schema Drift Discovery and Controlled Mitigation

Data Extraction and Sharing Via The Delta Sharing Protocol

A Deep Dive into the dbt Manifest | Squarespace

Powering Up the Business with a Lakehouse

Opening the Floodgates: Enabling Fast, Unmediated End User Access to Trillion-Row Datasets with SQL

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server

Analytics on your analytics, Drizly