Data Streaming

Streaming Data Into The Lakehouse With Iceberg And Trino At Going

2024-11-18 · Data Engineering Podcast Listen

podcast_episode

by Ken Pickering (Going) , Tobias Macey

Data Lakehouse Data Quality Databricks Iceberg Trino

In this episode, I had the pleasure of speaking with Ken Pickering, VP of Engineering at Going, about the intricacies of streaming data into a Trino and Iceberg lakehouse. Ken shared his journey from product engineering to becoming deeply involved in data-centric roles, highlighting his experiences in ecommerce and InsurTech. At Going, Ken leads the data platform team, focusing on finding travel deals for consumers, a task that involves handling massive volumes of flight data and event stream information.

Ken explained the dual approach of passive and active search strategies used by Going to manage the vast data landscape. Passive search involves aggregating data from global distribution systems, while active search is more transactional, querying specific flight prices. This approach helps Going sift through approximately 50 petabytes of data annually to identify the best travel deals.

We delved into the technical architecture supporting these operations, including the use of Confluent for data streaming, Starburst Galaxy for transformation, and Databricks for modeling. Ken emphasized the importance of an open lakehouse architecture, which allows for flexibility and scalability as the business grows.

Ken also discussed the composition of Going's engineering and data teams, highlighting the collaborative nature of their work and the reliance on vendor tooling to streamline operations. He shared insights into the challenges and strategies of managing data life cycles, ensuring data quality, and maintaining uptime for consumer-facing applications.

Throughout our conversation, Ken provided a glimpse into the future of Going's data architecture, including potential expansions into other travel modes and the integration of large language models for enhanced customer interaction. This episode offers a comprehensive look at the complexities and innovations in building a data-driven travel advisory service.

Feldera: Bridging Batch and Streaming with Incremental Computation

2024-11-04 · Data Engineering Podcast Listen

podcast_episode

by Lalith Suresh , Mihai Budiu , Leonid Ryzhyk , Tobias Macey

AI/ML Collibra Data Engineering Data Management Datafold Hudi Iceberg Lance

Summary In this episode of the Data Engineering Podcast, the creators of Feldera talk about their incremental compute engine designed for continuous computation of data, machine learning, and AI workloads. The discussion covers the concept of incremental computation, the origins of Feldera, and its unique ability to handle both streaming and batch data seamlessly. The guests explore Feldera's architecture, applications in real-time machine learning and AI, and challenges in educating users about incremental computation. They also discuss the balance between open-source and enterprise offerings, and the broader implications of incremental computation for the future of data management, predicting a shift towards unified systems that handle both batch and streaming data efficiently.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us you should listen to Data Citizens® Dialogues, the forward-thinking podcast from the folks at Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to Data Citizens Dialogues on Apple, Spotify, Youtube, or wherever you get your podcasts.Your host is Tobias Macey and today I'm interviewing Leonid Ryzhyk, Lalith Suresh, and Mihai Budiu about Feldera, an incremental compute engine for continous computation of data, ML, and AI workloadsInterview IntroductionCan you describe what Feldera is and the story behind it?DBSP (the theory behind Feldera) has won multiple awards from the database research community. Can you explain what it is and how it solves the incremental computation problem?Depending on which angle you look at it, Feldera has attributes of data warehouses, federated query engines, and stream processors. What are the unique use cases that Feldera is designed to address?In what situations would you replace another technology with Feldera?When is it an additive technology?Can you describe the architecture of Feldera?How have the design and scope evolved since you first started working on it?What are the state storage interfaces available in Feldera?What are the opportunities for integrating with or building on top of open table formats like Iceberg, Lance, Hudi, etc.?Can you describe a typical workflow for an engineer building with Feldera?You advertise Feldera's utility in ML and AI use cases in addition to data management. What are the features that make it conducive to those applications?What is your philosophy toward the community growth and engagement with the open source aspects of Feldera and how you're balancing that with sustainability of the project and business?What are the most interesting, innovative, or unexpected ways that you have seen Feldera used?What are the most interesting, unexpected, or challenging lessons that

Delta Lake: The Definitive Guide

2024-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Denny Lee (Databricks) , Prashanth Babu (Databricks) , Tristen Wentling (Databricks) , Scott Haines (Databricks)

Flink Data Engineering Delta Kafka Trino data data-engineering delta-lake storage-repositories

Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques. Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale. This book helps you: Understand key data reliability challenges and how Delta Lake solves them Explain the critical role of Delta transaction logs as a single source of truth Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino Architect data lakehouses with the medallion architecture Optimize Delta Lake performance with features like deletion vectors and liquid clustering

Coalesce 2024: Mixed model arts: The convergence of data modeling across apps, analytics, and AI

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Joe Reis (DeepLearning.AI)

AI/ML Analytics Cloud Computing Data Modelling dbt

For decades, siloed data modeling has been the norm: applications, analytics, and machine learning/AI. However, the emergence of AI, streaming data, and “shifting left" are changing data modeling, making siloed data approaches insufficient for the diverse world of data use cases. Today's practitioners must possess an end-to-end understanding of the myriad techniques for modeling data throughout the data lifecycle. This presentation covers "mixed model arts," which advocates converging various data modeling methods and the innovations of new ones.

Speaker: Joe Reis Author Nerd Herd

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

2024-09-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pavan Kumar Narayanan

AI/ML Airflow Analytics API AWS Azure Cloud Computing Data Analytics Data Engineering Data Quality GCP Kafka +9 more

This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code. The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows. What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world. What You Will Learn Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure Who This Book Is For Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists

High Performance Data Visualization for the Web

2024-09-25 · PyData Paris 2024

talk

by Tim Paine

DataViz JavaScript

Are you looking for a high performance visualization component for the web? Need to filter, sort, pivot, and aggregate static/streaming data in realtime? Daunted by the massive JS ecosystem? In this talk, we’ll build a high performance web frontend using the open source library Perspective.

Moving From Batch to Streaming With Delta Live Tables

2024-09-19 · Big Data LDN 2024

Face To Face

by Simon Whiteley (Advancing Analytics)

Delta

Unlocking New Opportunities with Generative AI and Data Streaming Platforms

2024-09-19 · Big Data LDN 2024

Face To Face

by Italo Nesi

AI/ML GenAI

Artificial Intelligence has transitioned from a niche concept to a widespread force shaping the business world's landscape. Streaming and AI integration have emerged as crucial drivers in this digital transformation era, focusing on the dynamic and real-time facets of data flow to generate contextually relevant predictions.

Businesses across diverse sectors increasingly adopt AI technology to optimise operations, stay competitive, and augment user experiences. However, AI's true potential only unfolds when applied to the right data sets, at the right moment, and within the appropriate context. In this session, Italo will discuss how AI and Streaming can work together to provide the latest and freshest data, be it about our customers, your business, or the market to your business.

Breaking Batch: Data Streaming and the Modern Data Landscape

2024-09-19 · Big Data LDN 2024

Face To Face

AI/ML Data Governance Kafka

During his talk, Alex will discuss the value of moving from a batch-based architecture to a real-time, event-driven architecture with Apache Kafka®. He will then explain how you can build valuable data products with Confluent that unlock high-volume performance, data governance, and AI use cases.

Breaking Batch: Data Streaming and the Modern Data Landscape

2024-09-19 · Exhibitors' Events - Auto created

Face To Face

by Alex Stuart

During his talk, Alex will discuss the value of moving from a batch-based architecture to a real-time, event-driven architecture with Apache

Building Hyper-Personalized LLM Applications with Rich Contextual Data

2024-09-19 · Big Data LDN 2024

Face To Face

by Cillian O'Shea , Jon Varley

AI/ML Data Engineering LLM Python RAG

In the era of AI-driven applications, personalization is paramount. This talk explores the concept of Full RAG (Retrieval-Augmented Generation) and its potential to revolutionize user experiences across industries. We examine four levels of context personalization, from basic recommendations to highly tailored, real-time interactions.

The presentation demonstrates how increasing levels of context - from batch data to streaming and real-time inputs - can dramatically improve AI model outputs. We discuss the challenges of implementing sophisticated context personalization, including data engineering complexities and the need for efficient, scalable solutions.

Introducing the concept of a Context Platform, we showcase how tools like Tecton can simplify the process of building, deploying, and managing personalized context at scale. Through practical examples in travel recommendations, we illustrate how developers can easily create and integrate batch, streaming, and real-time context using simple Python code, enabling more engaging and valuable AI-powered experiences.

Unlocking New Opportunities with Generative AI and Data Streaming Platforms

2024-09-19 · Exhibitors' Events - Auto created

Face To Face

by Italo Nesi

AI/ML GenAI

Artificial Intelligence has transitioned from a niche concept to a widespread force shaping the business world's landscape. Streaming and AI

Unlocking New Opportunities with Generative AI and Data Streaming Platforms

2024-09-18 · Big Data LDN 2024

Face To Face

by Italo Nesi

AI/ML GenAI

Artificial Intelligence has transitioned from a niche concept to a widespread force shaping the business world's landscape. Streaming and AI integration have emerged as crucial drivers in this digital transformation era, focusing on the dynamic and real-time facets of data flow to generate contextually relevant predictions.

Businesses across diverse sectors increasingly adopt AI technology to optimise operations, stay competitive, and augment user experiences. However, AI's true potential only unfolds when applied to the right data sets, at the right moment, and within the appropriate context. In this session, Italo will discuss how AI and Streaming can work together to provide the latest and freshest data, be it about our customers, your business, or the market to your business.

Scaling New Heights: Kroo Bank's Adventure in Data Stack Modernization

2024-09-18 · Big Data LDN 2024

Face To Face

by John Joseph Kennedy

Cloud Computing

Join us for an insightful fireside chat featuring Kroo Bank where as we dive into the world of data stack modernization within the fintech landscape. Kroo Bank is a rising star in the UK banking sector. They'll sit down with Aiven to share challenges and strategies when looking at their technology stack when facing the need to scale fast for their customers.

Key topics will include:

• Transitioning to a strongly asynchronous architecture to enhance performance and reliability.

• Optimizing data infrastructure for better performance and data observability.

• Exploring multi-cloud strategies and advanced streaming technologies.

• Evaluating Aiven's open source services for improved scalability, cost-efficiency, and disaster recovery.

Gain insights into how Kroo Bank plans to navigate the complexities of the digital age ensuring sustainable growth and innovation in a competitive market!

Turning data into millisecond business value.

2024-09-18 · Exhibitors' Events - Auto created

Face To Face

by David Rolfe

Volt’s stream processing solution bridges the gap between streaming data and OLTP, reducing time to value and simplifying your stack.

Mixed Model Arts - The Convergence of Data Modeling Across Apps, Analytics, and AI

2024-09-18 · Big Data LDN 2024

Face To Face

by Joe Reis (DeepLearning.AI)

AI/ML Analytics Data Modelling

For decades, data modeling has been fragmented by use cases: applications, analytics, and machine learning/AI. This leads to data siloing and “throwing data over the wall.”

With the emergence of AI, streaming data, and “shifting left" are changing data modeling, these siloed approaches are insufficient for the diverse world of data use cases. Today's practitioners must possess an end-to-end understanding of the myriad techniques for modeling data throughout the data lifecycle. This presentation covers "mixed model arts," which advocates converging various data modeling methods and the innovations of new ones.

Streaming Text Embeddings for Retrieval Augmented Generation (RAG)

2024-09-18 · Big Data LDN 2024

Face To Face

by James Kinley

LLM MongoDB RAG Redpanda

A 30 minute demo of how to use Redpanda Connect (powered by Benthos) to generate vector embeddings on streaming text. This session will walk through the architecture and configuration used to seamlessly integrate Redpanda Connect with LangChain, OpenAI, and MongoDB Atlas to build a complete Retrieval Augmented Generation data pipeline.

Streaming Text Embeddings for Retrieval Augmented Generation (RAG)

2024-09-18 · Exhibitors' Events - Auto created

Face To Face

by James Kinley

RAG Redpanda

A 30 minute demo of how to use Redpanda Connect (powered by Benthos) to generate vector embeddings on streaming text.

Data is at the Heart of Everything, The Value of Data Streaming at ABN AMRO Bank

2024-09-11 · Data Expo NL 2024

talk

by Shahebaz Shaik , Jelle Jansma

What is a data streaming platform? And when should you build versus buy one?

2024-09-11 · Data Expo NL 2024

talk

by Martijn Kieboom

talk-data.com

Activity Trend

Top Events

Top Speakers

Streaming Data Into The Lakehouse With Iceberg And Trino At Going

Feldera: Bridging Batch and Streaming with Incremental Computation

Delta Lake: The Definitive Guide

Coalesce 2024: Mixed model arts: The convergence of data modeling across apps, analytics, and AI

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

High Performance Data Visualization for the Web

Moving From Batch to Streaming With Delta Live Tables

Unlocking New Opportunities with Generative AI and Data Streaming Platforms

Breaking Batch: Data Streaming and the Modern Data Landscape

Breaking Batch: Data Streaming and the Modern Data Landscape

Building Hyper-Personalized LLM Applications with Rich Contextual Data

Unlocking New Opportunities with Generative AI and Data Streaming Platforms

Unlocking New Opportunities with Generative AI and Data Streaming Platforms

Scaling New Heights: Kroo Bank's Adventure in Data Stack Modernization

Turning data into millisecond business value.

Mixed Model Arts - The Convergence of Data Modeling Across Apps, Analytics, and AI

Streaming Text Embeddings for Retrieval Augmented Generation (RAG)

Streaming Text Embeddings for Retrieval Augmented Generation (RAG)

Data is at the Heart of Everything, The Value of Data Streaming at ABN AMRO Bank

What is a data streaming platform? And when should you build versus buy one?