Data Streaming

What's new with BigQuery

2024-04-10 · Google Cloud Next '24

session

by Steffen Grimmel (Deutsche Telekom) , Oliver Ratzesberger (Google Cloud) , Irina Farooq (Google Cloud) , Brian Welcker (Google Cloud)

AI/ML BigQuery Cloud Computing GCP GenAI Python Spark SQL

Join this session to learn the latest innovations for BigQuery to support all data, be it structured or unstructured, across multiple and open data formats, and cross-clouds; all workloads, be they Cloud SQL, Spark, or Python; and built-in AI, to supercharge the work of data teams and unlock generative AI across new use cases. Learn how you can take advantage of BigQuery, a single, unified data platform that combines capabilities including data processing, streaming, and governance.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

An Introduction to Streaming SQL with Materialize by Marta Paes

2024-04-10 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Marta Paes (Materialize)

AI/ML Big Data Data Science SQL

Big Data Europe Onsite and online on 22-25 November in 2022 Learn more about the conference: https://bit.ly/3BlUk9q

Join our next Big Data Europe conference on 22-25 November in 2022 where you will be able to learn from global experts giving technical talks and hand-on workshops in the fields of Big Data, High Load, Data Science, Machine Learning and AI. This time, the conference will be held in a hybrid setting allowing you to attend workshops and listen to expert talks on-site or online.

Real Time Streaming Data from AWS MSK Kafka to Cloudera by Lidor Gerstel

2024-04-10 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Lidor Gerstel (Centerity)

AI/ML AWS Big Data Data Science Kafka

Big Data Europe Onsite and online on 22-25 November in 2022 Learn more about the conference: https://bit.ly/3BlUk9q

Join our next Big Data Europe conference on 22-25 November in 2022 where you will be able to learn from global experts giving technical talks and hand-on workshops in the fields of Big Data, High Load, Data Science, Machine Learning and AI. This time, the conference will be held in a hybrid setting allowing you to attend workshops and listen to expert talks on-site or online.

Real-Time Streaming in Any and All Clouds, Hybrid and Beyond by Timothy J Spann

2024-04-10 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Timothy J Spann (Cloudera)

AI/ML Big Data Data Science

Big Data Europe Onsite and online on 22-25 November in 2022 Learn more about the conference: https://bit.ly/3BlUk9q

Join our next Big Data Europe conference on 22-25 November in 2022 where you will be able to learn from global experts giving technical talks and hand-on workshops in the fields of Big Data, High Load, Data Science, Machine Learning and AI. This time, the conference will be held in a hybrid setting allowing you to attend workshops and listen to expert talks on-site or online.

Streamlining Entry Into Streaming Analytics with JupyterHub & Apache Flink

2024-03-28 · Data Council Austin 2024 - Day 1 Watch

talk

by Elkhan Dadashov

Analytics Flink

Building a Large-Scale, Streaming-Based Logging and Monitoring Solution with Clojure

2024-03-14 · March 14th, Elixir+Clojure communities invite to BOBKONF pre-event-drinks #99

talk

clojure logging monitoring

Big Data Computing

2024-02-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bishwajeet Kumar Pandey , Tanvir Habib Sardar

Big Data Hadoop Hive NoSQL apache-hive data data-engineering

This book primarily aims to provide an in-depth understanding of recent advances in big data computing technologies, methodologies, and applications along with introductory details of big data computing models such as Apache Hadoop, MapReduce, Hive, Pig, Mahout in-memory storage systems, NoSQL databases, and big data streaming services.

Mastering Microsoft Fabric: SAASification of Analytics

2024-02-21 · O'Reilly Data Science Books O'Reilly Amazon

book

by Debananda Ghosh

AI/ML Analytics AWS Azure ADF BI Cloud Computing Data Engineering Data Lakehouse Data Management Data Science DWH +9 more

Learn and explore the capabilities of Microsoft Fabric, the latest evolution in cloud analytics suites. This book will help you understand how users can leverage Microsoft Office equivalent experience for performing data management and advanced analytics activity. The book starts with an overview of the analytics evolution from on premises to cloud infrastructure as a service (IaaS), platform as a service (PaaS), and now software as a service (SaaS version) and provides an introduction to Microsoft Fabric. You will learn how to provision Microsoft Fabric in your tenant along with the key capabilities of SaaS analytics products and the advantage of using Fabric in the enterprise analytics platform. OneLake and Lakehouse for data engineering is discussed as well as OneLake for data science. Author Ghosh teaches you about data warehouse offerings inside Microsoft Fabric and the new data integration experience which brings Azure Data Factory and Power Query Editor of Power BI together in a single platform. Also demonstrated is Real-Time Analytics in Fabric, including capabilities such as Kusto query and database. You will understand how the new event stream feature integrates with OneLake and other computations. You also will know how to configure the real-time alert capability in a zero code manner and go through the Power BI experience in the Fabric workspace. Fabric pricing and its licensing is also covered. After reading this book, you will understand the capabilities of Microsoft Fabric and its Integration with current and upcoming Azure OpenAI capabilities. What You Will Learn Build OneLake for all data like OneDrive for Microsoft Office Leverage shortcuts for cross-cloud data virtualization in Azure and AWS Understand upcoming OpenAI integration Discover new event streaming and Kusto query inside Fabric real-time analytics Utilize seamless tooling for machine learning and data science Who This Book Is For Citizen users and experts in the data engineering and data science fields, along with chief AI officers

Tackling Real Time Streaming Data With SQL Using RisingWave

2024-02-04 · Data Engineering Podcast Listen

podcast_episode

by Yingjun Wu (RisingWave Labs) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Delta DWH GitHub Hudi +5 more

Summary

Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3

Interview

Introduction How did you get involved in the area of data management? Can you describe what RisingWave is and the story behind it? There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses?

What are some of the platforms/architectures that teams are replacing with RisingWave?

What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem? Can you describe how RisingWave is architected and implemented?

How have the design and goals/scope changed since you first started working on it? What are the core design philosophies that you rely on to prioritize the ongoing development of the project?

What are the most complex engineering challenges that you have had to address in the creation of RisingWave? Can you describe a typical workflow for teams that are building on top of RisingWave?

What are the user/developer experience elements that you have prioritized most highly?

What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine? What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave? When is RisingWave the wrong choice? What do you have planned for the future of RisingWave?

Contact Info

yingjunwu on GitHub Personal Website LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows.

Machine Learning and Streaming Data Pipelines, Part I: Definitions and Architecture - Audio Blog

2024-01-10 · Secrets of Data Analytics Leaders Listen

podcast_episode

AI/ML

Many machine learning (ML) use cases center on real-time calculations. This article defines streaming ML and its architectural components. Published at: https://www.eckerson.com/articles/machine-learning-and-streaming-data-pipelines-part-i-definitions-and-architecture

Designing Data Platforms For Fintech Companies

2024-01-01 · Data Engineering Podcast Listen

podcast_episode

by Andrey Korchak (Monite) , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Governance Data Lake Data Lakehouse Data Management Dataflow Delta Hudi Iceberg +6 more

Summary

Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Andrey Korchak about how to manage data in a fintech environment

Interview

Introduction How did you get involved in the area of data management? Can you start by summarizing the data challenges that are particular to the fintech ecosystem? What are the primary sources and types of data that fintech organizations are working with?

What are the business-level capabilities that are dependent on this data?

How do the regulatory and business requirements influence the technology landscape in fintech organizations?

What does a typical build vs. buy decision process look like?

Fraud prediction in e.g. banks is one of the most well-established applications of machine learning in industry. What are some of the other ways that ML plays a part in fintech?

How does that influence the architectural design/capabilities for data platforms in those organizations?

Data governance is a notoriously challenging problem. What are some of the strategies that fintech companies are able to apply to this problem given their regulatory burdens? What are the most interesting, innovative, or unexpected approaches to data management that you have seen in the fintech sector? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data in fintech? What do you have planned for the future of your data capabilities at Monite?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Monite ISO 270001 Tesseract GitOps SWIFT Protocol

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Starburst: Starburst Logo

This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics.

Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. dataengineeringpodcast.com/starburstRudderstack:

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackMaterialize:

You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.

That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.

Go to materialize.com today and get 2 weeks free!Support Data Engineering Podcast

Architecting a Modern Data Warehouse for Large Enterprises: Build Multi-cloud Modern Distributed Data Warehouses with Azure and AWS

2023-12-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Abhishek Mishra , Anjani Kumar , Sanjeev Kumar (Tesa SE)

AWS Azure BI Big Data Cloud Computing Data Governance Data Lake Data Lakehouse Delta DWH Pandas Cyber Security +4 more

Design and architect new generation cloud-based data warehouses using Azure and AWS. This book provides an in-depth understanding of how to build modern cloud-native data warehouses, as well as their history and evolution. The book starts by covering foundational data warehouse concepts, and introduces modern features such as distributed processing, big data storage, data streaming, and processing data on the cloud. You will gain an understanding of the synergy, relevance, and usage data warehousing standard practices in the modern world of distributed data processing. The authors walk you through the essential concepts of Data Mesh, Data Lake, Lakehouse, and Delta Lake. And they demonstrate the services and offerings available on Azure and AWS that deal with data orchestration, data democratization, data governance, data security, and business intelligence. After completing this book, you will be ready to design and architect enterprise-grade, cloud-based modern data warehouses using industry best practices and guidelines. What You Will Learn Understand the core concepts underlying modern data warehouses Design and build cloud-native data warehousesGain a practical approach to architecting and building data warehouses on Azure and AWS Implement modern data warehousing components such as Data Mesh, Data Lake, Delta Lake, and Lakehouse Process data through pandas and evaluate your model’s performance using metrics such as F1-score, precision, and recall Apply deep learning to supervised, semi-supervised, and unsupervised anomaly detection tasks for tabular datasets and time series applications Who This Book Is For Experienced developers, cloud architects, and technology enthusiasts looking to build cloud-based modern data warehouses using Azure and AWS

Troubleshooting Kafka In Production

2023-12-24 · Data Engineering Podcast Listen

podcast_episode

by Elad Eldor , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg Kafka SaaS +2 more

Summary

Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Elad Eldor about operating Kafka in production and how to keep your clusters stable and performant

Interview

Introduction How did you get involved in the area of data management? Can you describe your experiences with Kafka?

What are the operational challenges that you have had to overcome while working with Kafka? What motivated to write a book about how to manage Kafka in production?

There are many options now for persistent data queues. What are the factors to consider when determining whether Kafka is the right choice?

In the case where Kafka is the appropriate tool, there are many ways to run it now. What are the considerations that teams need to work through when determining whether/where/how to operate a cluster?

When provisioning a Kafka cluster, what are the requirements that need to be considered when determining the sizing?

What are the axes along which size/scale need to be determined?

The core promise of Kafka is that it is a durable store for continuous data. What are the mechanisms that are available for preventing data loss?

Under what circumstances can data be lost?

What are the different failure conditions that cluster operators need to be aware of?

What are the monitoring strategies that ar

Adding An Easy Mode For The Modern Data Stack With 5X

2023-12-18 · Data Engineering Podcast Listen

podcast_episode

by Tarush Aggarwal (5xData) , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg Modern Data Stack SaaS +2 more

Summary

The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm welcoming back Tarush Aggarwal to talk about what he and his team at 5x data are building to improve the user experience of the modern data stack.

Interview

Introduction How did you get involved in the area of data management? Can you describe what 5x is and the story behind it?

We last spoke in March of 2022. What are the notable changes in the 5x business and product?

What are the notable shifts in the data ecosystem that have influenced your adoption and product direction?

What trends are you most focused on tracking as you plan the continued evolution of your offerings?

What are the points of friction that teams run into when trying to build their data platform? Can you describe design of the system that you have built?

What are the strategies that you rely on to support adaptability and speed of onboarding for new integrations?

What are some of the types of edge cases that you have to deal with while integrating and operating the platform implementations that you design for your customers? What is your process for selection of vendors to support?

How would you characte

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

2023-12-11 · Data Engineering Podcast Listen

podcast_episode

by Andrew Maguire , Tobias Macey

AI/ML Analytics Cloud Computing Dashboard Data Engineering Data Lake Data Lakehouse Data Management Delta Hudi Iceberg SaaS +2 more

Summary

If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrew Maguire about his work on the Anomstack project and how you can use it to run your own anomaly detection for your metrics

Interview

Introduction How did you get involved in the area of data management? Can you describe what Anomstack is and the story behind it?

What are your goals for this project? What other tools/products might teams be evaluating while they consider Anom

Yingjun Wu: Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing

2023-12-07 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Yingjun Wu (RisingWave Labs)

Analytics Big Data Data Analytics SQL

Join Yingjun Wu as we unlock the power of real-time insights in 'Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing.' 🚀 Explore how to leverage Change Data Capture (CDC) and modern SQL streaming databases to revolutionize your data analytics, and discover the magic of materialized views for instant, actionable insights. 📈💡 #RealTimeInsights #streamprocessing

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Yingjun Wu: Real time OLAP and Stream Processing Friends or Foes

2023-12-05 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Yingjun Wu (RisingWave Labs)

Analytics Big Data SQL

Join Yingjun Wu in unlocking the power of real-time insights through 'Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing.' 📊🔓 Discover how to harness Change Data Capture (CDC) and modern SQL streaming databases to drive real-time analytics, enabling businesses to stay ahead in the data-driven world. 🚀💡 #RealTimeInsights #streamprocessing

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Timothy Spann: Building Real-time Travel Alerts

2023-12-04 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Timothy Spann

Flink Big Data Kafka

Join Timothy Spann as he takes you on a journey of 'Building Real-time Travel Alerts' 🌍🚀. Learn how to construct a dynamic streaming application using Apache NiFi, Apache Kafka, and Apache Flink, ensuring optimal performance, productivity, and development simplicity in delivering timely travel advisories. 🌐🛫 #RealTimeAlerts #Streaming #ApacheStack

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Designing Data Transfer Systems That Scale

2023-12-04 · Data Engineering Podcast Listen

podcast_episode

by Andrei Tserakhau (DoubleCloud) , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Data Quality Datafold Delta Hudi Iceberg +3 more

Summary

The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues for every part of your data workflow, from migration to deployment. Datafold has recently launched a 3-in-1 product experience to support accelerated data migrations. With Datafold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project. Datafold leverages cross-database diffing to compare tables across environments in seconds, column-level lineage for smarter migration planning, and a SQL translator to make moving your SQL scripts easier. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold today! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrei Tserakhau about operationalizing high bandwidth and low-latency change-data capture

Interview

Introduction How did you get involved in the area of data management? Your most recent project involves operationalizing a generalized data transfer service. What was the original problem that you were trying to solve?

What were the shortcomings of other options in the ecosystem that led you to building a new system?

What was the design of your initial solution to the problem?

What are the sharp edges that you had to deal with to operate and use that i

On the Journey of Redefining Stream Processing: What We Learned from Building RisingWave?

2023-11-30 · Berlin Open Source Data Infrastructure Meetup - November 2023

talk

by Yingjun Wu (RisingWave Labs)

Cloud Computing Rust Snowflake postgresql

Abstract: RisingWave is an open-source streaming database designed from scratch for the cloud. It implemented a Snowflake-style storage-compute separation architecture to reduce performance cost, and provides users with a PostgreSQL-like experience for stream processing. Over the last three years, RisingWave has evolved from a one-person project to a rapidly-growing product deployed by nearly 100 enterprises and startups. But the journey of building RisingWave is full of challenges. In this talk, I'd like to share with you lessons we've gained from four dimensions: 1) the decoupled compute-storage architecture, 2) the balances between stream processing and OLAP, 3) the Rust ecosystem, and 4) the product positioning. I will dive deep into technical details and then share with you my views on the future of stream processing.

talk-data.com

Activity Trend

Top Events

Top Speakers

What's new with BigQuery

An Introduction to Streaming SQL with Materialize by Marta Paes

Real Time Streaming Data from AWS MSK Kafka to Cloudera by Lidor Gerstel

Real-Time Streaming in Any and All Clouds, Hybrid and Beyond by Timothy J Spann

Streamlining Entry Into Streaming Analytics with JupyterHub & Apache Flink

Building a Large-Scale, Streaming-Based Logging and Monitoring Solution with Clojure

Big Data Computing

Mastering Microsoft Fabric: SAASification of Analytics

Tackling Real Time Streaming Data With SQL Using RisingWave

Machine Learning and Streaming Data Pipelines, Part I: Definitions and Architecture - Audio Blog

Designing Data Platforms For Fintech Companies

Architecting a Modern Data Warehouse for Large Enterprises: Build Multi-cloud Modern Distributed Data Warehouses with Azure and AWS

Troubleshooting Kafka In Production

Adding An Easy Mode For The Modern Data Stack With 5X

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Yingjun Wu: Unlocking Real-time Insights: Enhancing Your Databases With Stream Processing

Yingjun Wu: Real time OLAP and Stream Processing Friends or Foes

Timothy Spann: Building Real-time Travel Alerts

Designing Data Transfer Systems That Scale

On the Journey of Redefining Stream Processing: What We Learned from Building RisingWave?