talk-data.com talk-data.com

Topic

Data Streaming

realtime event_processing data_flow

739

tagged

Activity Trend

70 peak/qtr
2020-Q1 2026-Q1

Activities

739 activities · Newest first

“Last week was a great year in GenAI,” jokes Mark Ramsey—and it’s a great philosophy to have as LLM tools especially continue to evolve at such a rapid rate. This week, you’ll get to hear my fun and insightful chat with Mark from Ramsey International about the world of large language models (LLMs) and how we make useful UXs out of them in the enterprise. 

Mark shared some fascinating insights about using a company’s website information (data) as a place to pilot a LLM project, avoiding privacy landmines, and how re-ranking of models leads to better LLM response accuracy. We also talked about the importance of real human testing to ensure LLM chatbots and AI tools truly delight users. From amusing anecdotes about the spinning beach ball on macOS to envisioning a future where AI-driven chat interfaces outshine traditional BI tools, this episode is packed with forward-looking ideas and a touch of humor.

Highlights/ Skip to:

(0:50) Why is the world of GenAI evolving so fast? (4:20) How Mark thinks about UX in an LLM application (8:11) How Mark defines “Specialized GenAI?” (12:42) Mark’s consulting work with GenAI / LLMs these days (17:29) How GenAI can help the healthcare industry (30:23) Uncovering users’ true feelings about LLM applications (35:02) Are UIs moving backwards as models progress forward? (40:53) How will GenAI impact data and analytics teams? (44:51) Will LLMs be able to consistently leverage RAG and produce proper SQL? (51:04) Where can find more from Mark and Ramsey International

Quotes from Today’s Episode “With [GenAI], we have a solution that we’ve built to try to help organizations, and build workflows. We have a workflow that we can run and ask the same question [to a variety of GenAI models] and see how similar the answers are. Depending on the complexity of the question, you can see a lot of variability between the models… [and] we can also run the same question against the different versions of the model and see how it’s improved. Folks want a human-like experience interacting with these models.. [and] if the model can start responding in just a few seconds, that gives you much more of a conversational type of experience.” - Mark Ramsey (2:38) “[People] don’t understand when you interact [with GenAI tools] and it brings tokens back in that streaming fashion, you’re actually seeing inside the brain of the model. Every token it produces is then displayed on the screen, and it gives you that typewriter experience back in the day. If someone has to wait, and all you’re seeing is a logo spinning, from a UX experience standpoint… people feel like the model is much faster if it just starts to produce those results in that streaming fashion. I think in a design, it’s extremely important to take advantage of that [...] as opposed to waiting to the end and delivering the results some models support that, and other models don’t.”- Mark Ramsey (4:35) "All of the data that’s on the website is public information. We’ve done work with several organizations on quickly taking the data that’s on their website, packaging it up into a vector database, and making that be the source for questions that their customers can ask. [Organizations] publish a lot of information on their websites, but people really struggle to get to it. We’ve seen a lot of interest in vectorizing website data, making it available, and having a chat interface for the customer. The customer can ask questions, and it will take them directly to the answer, and then they can use the website as the source information.” - Mark Ramsey (14:04) “I’m not skeptical at all. I’ve changed much of my [AI chatbot searches] to Perplexity, and I think it’s doing a pretty fantastic job overall in terms of quality. It’s returning an answer with citations, so you have a sense of where it’s sourcing the information from. I think it’s important from a user experience perspective. This is a replacement for broken search, as I really don’t want to read all the web pages and PDFs you have that might be about my chiropractic care query to answer my actual [healthcare] question.” - Brian O’Neill (19:22)

“We’ve all had great experience with customer service, and we’ve all had situations where the customer service was quite poor, and we’re going to have that same thing as we begin to [release more] chatbots. We need to make sure we try to alleviate having those bad experiences, and have an exit. If someone is running into a situation where they’d rather talk to a live person, have that ability to route them to someone else. That’s why the robustness of the model is extremely important in the implementation… and right now, organizations like OpenAI and Anthropic are significantly better at that [human-like] experience.” - Mark Ramsey (23:46) "There’s two aspects of these models: the training aspect and then using the model to answer questions. I recommend to organizations to always augment their content and don’t just use the training data. You’ll still get that human-like experience that’s built into the model, but you’ll eliminate the hallucinations. If you have a model that has been set up correctly, you shouldn’t have to ask questions in a funky way to get answers.” - Mark Ramsey (39:11) “People need to understand GenAI is not a predictive algorithm. It is not able to run predictions, it struggles with some math, so that is not the focus for these models. What’s interesting is that you can use the model as a step to get you [the answers]. A lot of the models now support functions… when you ask a question about something that is in a database, it actually uses its knowledge about the schema of the database. It can build the query, run the query to get the data back, and then once it has the data, it can reformat the data into something that is a good response back." - Mark Ramsey (42:02)

Links Mark on LinkedIn Ramsey International Email: mark [at] ramsey.international Ramsey International's YouTube Channel

Streaming Databases

Real-time applications are becoming the norm today. But building a model that works properly requires real-time data from the source, in-flight stream processing, and low latency serving of its analytics. With this practical book, data engineers, data architects, and data analysts will learn how to use streaming databases to build real-time solutions. Authors Hubert Dulay and Ralph M. Debusmann take you through streaming database fundamentals, including how these databases reduce infrastructure for real-time solutions. You'll learn the difference between streaming databases, stream processing, and real-time online analytical processing (OLAP) databases. And you'll discover when to use push queries versus pull queries, and how to serve synchronous and asynchronous data emanating from streaming databases. This guide helps you: Explore stream processing and streaming databases Learn how to build a real-time solution with a streaming database Understand how to construct materialized views from any number of streams Learn how to serve synchronous and asynchronous data Get started building low-complexity streaming solutions with minimal setup

Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow. This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on. In particular, we will discuss ways that we’ve increased Airflow performance to meet application-specific demands (high-task-count Cosmos DAGs, streaming jobs in Chronon), new Airflow features that will evolve how these frameworks use Airflow under the hood (DAG versioning, dataset integrations), and paths we see these projects taking over the next few years as Airflow grows. Airflow is not just a DAG platform, it’s an application platform!

This usecase shows how we deal with data of different varieties from different sources. Each source sends data in different layout, timings, structures, location patterns sizes. The goal is to process the files within SLA and send them out. This a complex multi step processing pipeline that involves multiple spark jobs, api based integrations with microservices, resolving unique ids, deduplication and filtering. Note that this is an event driven system, but not a streaming data system. The files are of gigabyte scale, and each day the data being processed is of terabyte scale. We will be talking about how to make DAG creation and business logic building a “low-code no-code process” so that non technical analysts can write business logic and light developers can deploy DAGs without much manual effort. Every aspect is either source specific or source-agnostic configuration driven. Airflow was chosen to enable easy DAG building, scaling, monitoring, troubleshooting and rerunning.

Databricks LakeFlow: A Unified, Intelligent Solution for Data Engineering. Presented by Bilal Aslam

Speaker: Bilal Aslam, Sr. Director of Product Management, Databricks

Bilal explains that everything starts with good data and outlines the three steps to good data including, ingesting, transforming and orchestrating your data. Then Bilal announces Databricks LakeFlow - a unified solution for data engineering. With LakeFlow you can ingest data from databases, enterprise apps and cloud sources, transform it in batch and real-time streaming, and confidently deploy and operate in production. Includes a live demo of Databricks LakeFlow.

To learn more about Databricks LakeFlow, see the announcement blog post: https://www.databricks.com/blog/introducing-databricks-lakeflow

Databricks Certified Associate Developer for Apache Spark Using Python

This book serves as the ultimate preparation for aspiring Databricks Certified Associate Developers specializing in Apache Spark. Deep dive into Spark's components, its applications, and exam techniques to achieve certification and expand your practical skills in big data processing and real-time analytics using Python. What this Book will help me do Deeply understand Apache Spark's core architecture for building big data applications. Write optimized SQL queries and leverage Spark DataFrame API for efficient data manipulation. Apply advanced Spark functions, including UDFs, to solve complex data engineering tasks. Use Spark Streaming capabilities to implement real-time and near-real-time processing solutions. Get hands-on preparation for the certification exam with mock tests and practice questions. Author(s) Saba Shah is a seasoned data engineer with extensive experience working at Databricks and leading data science teams. With her in-depth knowledge of big data applications and Spark, she delivers clear, actionable insights in this book. Her approach emphasizes practical learning and real-world applications. Who is it for? This book is ideal for data professionals such as engineers and analysts aiming to achieve Databricks certification. It is particularly helpful for individuals with moderate Python proficiency who are keen to understand Spark from scratch. If you're transitioning into big data roles, this guide prepares you comprehensively.

Summary

Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness of Flink.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ronen Korman and Stav Elkayam about pulling back the curtain on your real-time data streams by bringing intuitive observability to Flink streams

Interview

Introduction How did you get involved in the area of data management? Can you describe what Datorios is and the story behind it? Data observability has been gaining adoption for a number of years now, with a large focus on data warehouses. What are some of the unique challenges posed by Flink?

How much of the complexity is due to the nature of streaming data vs. the architectural realities of Flink?

How has the lack of visibility into the flow of data in Flink impacted the ways that teams think about where/when/how to apply it? How have the requirements of generative AI shifted the demand for streaming data systems?

What role does Flink play in the architecture of generative AI systems?

Can you describe how Datorios is implemented?

How has the design and goals of Datorios changed since you first started working on it?

How much of the Datorios architecture and functionality is specific to Flink and how are you thinking about its potential application to other streaming platforms? Can you describe how Datorios is used in a day-to-day workflow for someone building streaming applications on Flink? What are the most interesting, innovative, or unexpected ways that you have seen Datorios used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datorios? When is Datorios the wrong choice? What do you have planned for the future of Datorios?

Contact Info

Ronen

LinkedIn

Stav

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to

Data Engineering with Databricks Cookbook

In "Data Engineering with Databricks Cookbook," you'll learn how to efficiently build and manage data pipelines using Apache Spark, Delta Lake, and Databricks. This recipe-based guide offers techniques to transform, optimize, and orchestrate your data workflows. What this Book will help me do Master Apache Spark for data ingestion, transformation, and analysis. Learn to optimize data processing and improve query performance with Delta Lake. Manage streaming data processing with Spark Structured Streaming capabilities. Implement DataOps and DevOps workflows tailored for Databricks. Enforce data governance policies using Unity Catalog for scalable solutions. Author(s) Pulkit Chadha, the author of this book, is a Senior Solutions Architect at Databricks. With extensive experience in data engineering and big data applications, he brings practical insights into implementing modern data solutions. His educational writings focus on empowering data professionals with actionable knowledge. Who is it for? This book is ideal for data engineers, data scientists, and analysts who want to deepen their knowledge in managing and transforming large datasets. Readers should have an intermediate understanding of SQL, Python programming, and basic data architecture concepts. It is especially well-suited for professionals working with Databricks or similar cloud-based data platforms.

Kafka Streams in Action, Second Edition

Everything you need to implement stream processing on Apache KafkaⓇ using Kafka Streams and the kqsIDB event streaming database. Kafka Streams in Action, Second Edition guides you through setting up and maintaining your streaming processing with Kafka. Inside, you’ll find comprehensive coverage of not only Kafka Streams, but the entire toolbox you’ll need for effective streaming—from the components of the Kafka ecosystem, to Producer and Consumer clients, Connect, and Schema Registry. In Kafka Streams in Action, Second Edition you’ll learn how to: Design streaming applications in Kafka Streams with the KStream and the Processor API Integrate external systems with Kafka Connect Enforce data compatibility with Schema Registry Build applications that respond immediately to events in either Kafka Streams or ksqlDB Craft materialized views over streams with ksqlDB This totally revised new edition of Kafka Streams in Action has been expanded to cover more of the Kafka platform used for building event-based applications. You’ll also find full coverage of ksqlDB, an event streaming database that makes it a snap to create applications that respond immediately to events, such as real-time push and pull updates. About the Technology Enterprise applications need to handle thousands—even millions—of data events every day. With an intuitive API and flawless reliability, the lightweight Kafka Streams library has earned a spot at the center of these systems. Kafka Streams provides exactly the power and simplicity you need to manage real-time event processing or microservices messaging. About the Book Kafka Streams in Action, Second Edition teaches you how to create event streaming applications on the amazing Apache Kafka platform. This thoroughly revised new edition now covers a wider range of streaming architectures and includes data integration with Kafka Connect. As you go, you’ll explore real-world examples that introduce components and brokers, schema management, and the other essentials. Along the way, you’ll pick up practical techniques for blending Kafka with Spring, low-level control of processors and state stores, storing event data with ksqlDB, and testing streaming applications. What's Inside Design efficient streaming applications Integrate external systems with Kafka Connect Enforce data compatibility with Schema Registry About the Reader For Java developers. No knowledge of Kafka or streaming applications required. About the Author Bill Bejeck is a Confluent engineer and a Kafka Streams contributor with over 15 years of software development experience. Bill is also a committer on the Apache KafkaⓇ project. Quotes Comprehensive streaming data applications are only a few years away from becoming the reality, and this book is the guide the industry has been waiting for to move beyond the hype. - Adi Polak, Director, Developer Experience Engineering, Confluent Covers all the key aspects of building applications with Kafka Streams. Whether you are getting started with stream processing or have already built Kafka Streams applications, it is an essential resource. - Mickael Maison, Principal Software Engineer, Red Hat Serves as both a learning and a resource guide, offering a perfect blend of ‘how-to’ and ‘why-to.’ Even if you have been using Kafka Streams for many years, I highly recommend this book. - Neil Buesing, CTO & Co-founder, Kinetic Edge

Any time a Netflix member sits down, reclines in their chair and turns on their TV to Netflix, there's a moment of truth. It's an opportunity to deliver a spectacular service with amazing quality of experience. Misses, errors, or high latency that prevent individuals from streaming, as a result of ISP configuration changes, code deployment, or catastrophic fallback, result in an impact on how our service is perceived. This talk will go over how we measure the quality of experience for our members and how we work to develop new metrics when we have additional offerings like live streaming and cloud gaming.

Apache Iceberg: The Definitive Guide

Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool—a cost-prohibitive process for making warehouse features available to all of your data. The lack of flexibility with these patterns requires you to lock into a set of priority tools and formats, which creates data silos and data drift. This practical book shows you a better way. Apache Iceberg provides the capabilities, performance, scalability, and savings that fulfill the promise of an open data lakehouse. By following the lessons in this book, you'll be able to achieve interactive, batch, machine learning, and streaming analytics with this high-performance open source format. Authors Tomer Shiran, Jason Hughes, and Alex Merced from Dremio show you how to get started with Iceberg. With this book, you'll learn: The architecture of Apache Iceberg tables What happens under the hood when you perform operations on Iceberg tables How to further optimize Iceberg tables for maximum performance How to use Iceberg with popular data engines such as Apache Spark, Apache Flink, and Dremio Discover why Apache Iceberg is a foundational technology for implementing an open data lakehouse.

Take the next step in your AI/ML journey with streaming data. Learn to deploy and manage complete ML pipelines to run inference and predictions, classify images, run remote inference calls, build a custom model handler, and much more with the latest innovations in Dataflow ML. Learn how Spotify leveraged Dataflow for large-scale generation of ML podcast previews and how they plan to keep pushing the boundaries of what’s possible with data engineering and data science to build better experiences for their customers and creators.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

Accessing mission-critical data in a nonintrusive fashion will be critical for enabling operational analytics, and with the evolution of generative AI, enterprises are building RAG-based gen AI applications that require access to operational data. Datastream is a simple, serverless data-streaming platform that organizes the ingesting, processing, and analyzing operational data to support AI/ML and RAG apps. Experts from RocketMoney and Intuit Mailchimp will share how they’re using Datastream to solve for Operational Analytics and beyond.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

We're in the a Cambrian Explosion of data architectures. In the last two years, dozens of vendors have each championed their own version of ‘the modern data architecture solution’, all claiming to be the future of IT in a data-driven enterprise. The sheer volume of architectures is daunting: Streaming data platforms, data lakes, structured/semi-structured/unstructured data, cloud data warehouses supporting external tables and federated query processing, lakehouses, data fabric, and layers of federated query platforms that offer virtual views of data. All claim to support the building of data products.

No surprise that customers are confused as to which option to choose. 

However, key changes have emerged including much broader support for open table formats such as Apache Iceberg, Apache Hudi and Delta Lake in many other vendor data platforms. In addition, we have seen significant new milestones in extending the ISO SQL Standard to support new kinds of analytics in general purpose SQL. Also, AI has also advanced to work across any type of data. 

What does this all mean for data management? What is the impact of this on analytical data platforms and what does it mean for customers? What opportunities does this evolution open up for tools vendors whose data foundation is reliant on other vendor database management systems and data platforms? This session looks at this evolution, helping vendors and IT professionals alike realise the potential of what’s now possible and how they can exploit it for competitive advantage.

Cloud Run is Google Cloud's serverless runtime. It’s the simplest way to deploy a website or web API or perform streaming and batch data processing. In this session, we’ll cover what's new for two major areas of Cloud Run: Enterprise architectures and streamlined application management.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

Businesses everywhere have the opportunity to drive transformational impact by leveraging streaming data to make decisions and build experiences that delight users. In this session you will learn how MercadoLibre processes tens of billions of messages across thousands of applications to drive business impact. You will also learn about exciting new product announcements ranging from native ingest capabilities in Cloud Pub/Sub to new efficiency features in Dataflow to support for OSS technologies.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.