Being Data Driven At Stripe With Trino And Iceberg

2024-06-16 · Data Engineering Podcast Listen

podcast_episode

by Kevin Liu (Stripe) , Tobias Macey

AI/ML Flink Athena AWS Data Engineering Data Lake Data Lakehouse Databricks Delta Hive Hudi Iceberg +5 more

Summary

Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse

Interview

Introduction How did you get involved in the area of data management? Can you describe what role Trino and Iceberg play in Stripe's data architecture?

What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure?

What were the requirements and selection criteria that led to the selection of that combination of technologies?

What are the other systems that feed into and rely on the Trino/Iceberg service?

what kinds of questions are you answering with table metadata

what use case/team does that support

comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure? When is a lakehouse on Trino/Iceberg the wrong choice? What do you have planned for the future of Trino and Iceberg at Stripe?

Contact Info

Substack LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

Trino Iceberg Stripe Spark Redshift Hive Metastore Python Iceberg Python Iceberg REST Catalog Trino Metadata Table Flink

Podcast Episode

Tabular

Podcast Episode

Delta Table

Podcast Episode

Databricks Unity Catalog Starburst AWS Athena Kevin Trinofest Presentation Alluxio

Podcast Episode

Parquet Hudi Trino Project Tardigrade Trino On Ice

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Starburst: Starburst Logo

This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake.

Trusted by the teams at Comcast and Doordash, Starburst del

Data + AI Summit Keynote Day 1 - Full

2024-06-14 · Databricks DATA + AI Summit 2023 Watch

video

by Patrick Wendall (Databricks) , Fei-Fei Li (Stanford University) , Brian Ames (General Motors) , Ken Wong (Databricks) , Ali Ghodsi (Databricks) , Jackie Brosamer (Block) , Reynold Xin (Databricks) , Jensen Huang (NVIDIA)

AI/ML Analytics BI Data Governance Databricks DWH GenAI

Databricks Data + AI Summit 2024 Keynote Day 1

Experts, researchers and open source contributors — from Databricks and across the data and AI community gathered in San Francisco June 10 - 13, 2024 to discuss the latest technologies in data management, data warehousing, data governance, generative AI for the enterprise, and data in the era of AI.

Hear from Databricks Co-founder and CEO Ali Ghodsi on building generative AI applications, putting your data to work, and how data + AI leads to data intelligence.

Plus a fireside chat between Ali Ghodsi and Nvidia Co-founder and CEO, Jensen Huang, on the expanded partnership between Nvidia and Databricks to accelerate enterprise data for the era of generative AI

Product announcements in the video include: - Databricks Data Intelligence Platform - Native support for NVIDIA GPU acceleration on the Databricks Data Intelligence Platform - Databricks open source model DBRX available as an NVIDIA NIM microservice - Shutterstock Image AI powered by Databricks - Databricks AI/BI - Databricks LakeFlow - Databricks Mosaic AI - Mosaic AI Agent Framework - Mosaic AI Agent Evaluation - Mosaic AI Tools Catalog - Mosaic AI Model Training - Mosaic AI Gateway

In this keynote hear from: - Ali Ghodsi, Co-founder and CEO, Databricks (1:45) - Brian Ames, General Motors (29:55) - Patrick Wendall, Co-founder and VP of Engineering, Databricks (38:00) - Jackie Brosamer, Head of AI, Data and Analytics, Block (1:14:42) - Fei Fei Li, Professor, Stanford University and Denning Co-Director, Stanford Institute for Human-Centered AI (1:23:15) - Jensen Huang, Co-founder and CEO of NVIDIA with Ali Ghodsi, Co-founder and CEO of Databricks (1:42:27) - Reynold Xin, Co-founder and Chief Architect, Databricks (2:07:43) - Ken Wong, Senior Director, Product Management, Databricks (2:31:15) - Ali Ghodsi, Co-founder and CEO, Databricks (2:48:16)

Python and SQL Bible

2024-06-14 · O'Reilly SQL Books O'Reilly Amazon

book

by Cuantum Technologies LLC

Python SQL

The 'Python and SQL Bible' is a comprehensive guide to mastering both Python programming and SQL querying. Starting from the very basics, the book takes readers through advanced techniques, including data manipulation, database management, and integration of Python with SQL, all while offering hands-on examples and real-world exercises. What this Book will help me do Gain a strong foundation in Python programming, including control flow, functions, and object-oriented programming. Learn how to write advanced SQL queries for data extraction, manipulation, and reporting. Understand how to integrate Python with SQL to form a seamless data manipulation workflow. Develop data analysis skills using Python and tools such as SQLAlchemy for advanced insights. Master database administration techniques to efficiently manage and query datasets. Author(s) Cuantum Technologies LLC is a renowned tech education provider with a focus on equipping learners with in-demand programming and data management skills. Their training methods blend theory with practice, ensuring students gain hands-on experience applicable in professional environments. Their team of experts crafts content to cater to both beginners and professionals seeking to advance their skill set. Who is it for? This book is ideal for beginners who are new to programming and experienced professionals who wish to master Python and SQL for data manipulation and analysis. It is perfect for aspiring data scientists, software developers, and IT professionals looking to unlock new career opportunities. By detailing concepts and providing practical exercises, it accommodates various skill levels and prepares readers for industry demands.

145 - Data Product Success: Adopting a Customer-Centric Approach With Malcolm Hawker, Head of Data Management at Profisee

2024-06-11 · Experiencing Data w/ Brian T. O’Neill (AI & data product management leadership—powered by UX design) Listen

podcast_episode

by Malcolm Hawker (Profisee) , Brian O’Neill (Designing for Analytics)

AI/ML Analytics Dashboard Data Science

Wait, I’m talking to a head of data management at a tech company? Why!? Well, today I'm joined by Malcolm Hawker to get his perspective around data products and what he’s seeing out in the wild as Head of Data Management at Profisee. Why Malcolm? Malcolm was a former head of product in prior roles, and for several years, I’ve enjoyed Malcolm’s musings on LinkedIn about the value of a product-oriented approach to ML and analytics. We had a chance to meet at CDOIQ in 2023 as well and he went on my “need to do an episode” list!

According to Malcom, empathy is the secret to addressing key UX questions that ensure adoption and business value. He also emphasizes the need for data experts to develop business skills so that they're seen as equals by their customers. During our chat, Malcolm stresses the benefits of a product- and customer-centric approach to data products and what data professionals can learn approaching problem solving with a product orientation.

Highlights/ Skip to:

Malcolm’s definition of a data product (2:10) Understanding your customers’ needs is the first step toward quantifying the benefits of your data product (6:34) How product makers can gain access to users to build more successful products (11:36) Answering the UX question to get past the adoption stage and provide business value (16:03) Data experts must develop business expertise if they want to be seen as equals by potential customers (20:07) What people really mean by “data culture" (23:02) Malcolm’s data product journey and his changing perspective (32:05) Using empathy to provide a better UX in design and data (39:24) Avoiding the death of data science by becoming more product-driven (46:23) Where the majority of data professionals currently land on their view of product management for data products (48:15)

Quotes from Today’s Episode “My definition of a data product is something that is built by a data and analytics team that solves a specific customer problem that the customer would otherwise be willing to pay for. That’s it.” - Malcolm Hawker (3:42) “You need to observe how your customer uses data to make better decisions, optimize a business process, or to mitigate business risk. You need to know how your customers operate at a very, very intimate level, arguably, as well as they know how their business processes operate.” - Malcolm Hawker (7:36)

“So, be a problem solver. Be collaborative. Be somebody who is eager to help make your customers’ lives easier. You hear "no" when people think that you’re a burden. You start to hear more “yeses” when people think that you are actually invested in helping make their lives easier.” - Malcolm Hawker (12:42)

“We [data professionals] put data on a pedestal. We develop this mindset that the data matters more—as much or maybe even more than the business processes, and that is not true. We would not exist if it were not for the business. Hard stop.” - Malcolm Hawker (17:07)

“I hate to say it, I think a lot of this data stuff should kind of feel invisible in that way, too. It’s like this invisible ally that you’re not thinking about the dashboard; you just access the information as part of your natural workflow when you need insights on making a decision, or a status check that you’re on track with whatever your goal was. You’re not really going out of mode.” - Brian O’Neill (24:59)

“But you know, data people are basically librarians. We want to put things into classifications that are logical and work forwards and backwards, right? And in the product world, sometimes they just don’t, where you can have something be a product and be a material to a subsequent product.” - Malcolm Hawker (37:57)

“So, the broader point here is just more of a mindset shift. And you know, maybe these things aren’t necessarily a bad thing, but how do we become a little more product- and customer-driven so that we avoid situations where everybody thinks what we’re doing is a time waster?” - Malcolm Hawker (48:00)

Links Profisee: https://profisee.com/ LinkedIn: https://www.linkedin.com/in/malhawker/ CDO Matters: https://profisee.com/cdo-matters-live-with-malcolm-hawker/

X-Ray Vision For Your Flink Stream Processing With Datorios

2024-06-09 · Data Engineering Podcast Listen

podcast_episode

by Stav Elkayam , Ronen Korman , Tobias Macey

AI/ML Analytics Flink Data Engineering Data Lake Data Lakehouse Delta GenAI Hive Iceberg Data Streaming Trino

Summary

Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness of Flink.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ronen Korman and Stav Elkayam about pulling back the curtain on your real-time data streams by bringing intuitive observability to Flink streams

Interview

Introduction How did you get involved in the area of data management? Can you describe what Datorios is and the story behind it? Data observability has been gaining adoption for a number of years now, with a large focus on data warehouses. What are some of the unique challenges posed by Flink?

How much of the complexity is due to the nature of streaming data vs. the architectural realities of Flink?

How has the lack of visibility into the flow of data in Flink impacted the ways that teams think about where/when/how to apply it? How have the requirements of generative AI shifted the demand for streaming data systems?

What role does Flink play in the architecture of generative AI systems?

Can you describe how Datorios is implemented?

How has the design and goals of Datorios changed since you first started working on it?

How much of the Datorios architecture and functionality is specific to Flink and how are you thinking about its potential application to other streaming platforms? Can you describe how Datorios is used in a day-to-day workflow for someone building streaming applications on Flink? What are the most interesting, innovative, or unexpected ways that you have seen Datorios used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datorios? When is Datorios the wrong choice? What do you have planned for the future of Datorios?

Contact Info

Ronen

Stav

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to

#213 Building Trust Through Data with Prukalpa Sankar, Co-Founder of Atlan

2024-06-06 · DataFramed Listen

podcast_episode

by Richie (DataCamp) , Prukalpa Sankar (Atlan)

AI/ML BI Data Governance Data Quality Data Science DataOps DWH GitHub Modern Data Stack

In the fast-paced work environments we are used to, the ability to quickly find and understand data is essential. Data professionals can often spend more time searching for data than analyzing it, which can hinder business progress. Innovations like data catalogs and automated lineage systems are transforming data management, making it easier to ensure data quality, trust, and compliance. By creating a strong metadata foundation and integrating these tools into existing workflows, organizations can enhance decision-making and operational efficiency. But how did this all come to be, who is driving better access and collaboration through data? Prukalpa Sankar is the Co-founder of Atlan. Atlan is a modern data collaboration workspace (like GitHub for engineering or Figma for design). By acting as a virtual hub for data assets ranging from tables and dashboards to models & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Slack, BI tools, data science tools and more. A pioneer in the space, Atlan was recognized by Gartner as a Cool Vendor in DataOps, as one of the top 3 companies globally. Prukalpa previously co-founded SocialCops, world leading data for good company (New York Times Global Visionary, World Economic Forum Tech Pioneer). SocialCops is behind landmark data projects including India’s National Data Platform and SDGs global monitoring in collaboration with the United Nations. She was awarded Economic Times Emerging Entrepreneur for the Year, Forbes 30u30, Fortune 40u40, Top 10 CNBC Young Business Women 2016, and a TED Speaker. In the episode, Richie and Prukalpa explore challenges within data discoverability, the inception of Atlan, the importance of a data catalog, personalization in data catalogs, data lineage, building data lineage, implementing data governance, human collaboration in data governance, skills for effective data governance, product design for diverse audiences, regulatory compliance, the future of data management and much more. Links Mentioned in the Show: AtlanConnect with Prukalpa[Course] Artificial Intelligence (AI) StrategyRelated Episode: Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at SnowflakeSign up to RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

Assessing the impact of algorithmic systems with Tamara Kneese

2024-06-06 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by Jason Foster (Cynozure) , Tamara Kneese (Data & Society Research Institute)

AI/ML Analytics

Join Jason Foster in this thought-provoking discussion with Tamara Kneese, the lead of the Algorithmic Methods Lab at Data & Society Research Institute, to delve into the intricate world of algorithmic systems and their far-reaching impacts. In this episode, they tackle the pressing questions surrounding the assessment of algorithms—environmental impact, societal fairness, and the need for responsible AI. Join the conversation as they discuss the challenges and opportunities these systems present and the vital steps organisations must take to navigate their influence.

Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. They work with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and change management and leadership. The company was named one of The Sunday Times' fastest-growing private companies in 2022 and 2023 and named the Best Place to Work in Data by DataIQ in 2023. For more information, visit www.cynozure.com.Check out our free AI Scorecard and we'll send you a personalised report.

Generative AI in action: Datalex's innovative approach (L300) | AWS Events

2024-06-04 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML AWS Cloud Computing GenAI

This session goes through the new capabilities of Amazon Codewhisperer, Q, and Bedrock. Learn how Datalex have adopted these technologies for their product development processes and data management platform.

Learn more: https://go.aws/3x2mha0 Learn more about AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSEvents #GenerativeAI #AI #Cloud #AWSAIandDataConference

Practical First Steps In Data Governance For Long Term Success

2024-06-02 · Data Engineering Podcast Listen

podcast_episode

by Nicola Askham , Tobias Macey

AI/ML Data Engineering Data Governance Data Lake Data Lakehouse Delta Hive Iceberg Python Trino

Summary

Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business. In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support. Your host is Tobias Macey and today I'm interviewing Nicola Askham about the practical steps of building out a data governance practice in your organization

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of the scope and boundaries of data governance in an organization?

At what point does a lack of an explicit governance policy become a liability?

What are some of the misconceptions that you encounter about data governance? What impact has the evolution of data technologies had on the implementation of governance practices? (e.g. number/scale of systems, types of data, AI) Data governance can often become an exercise in boiling the ocean. What are the concrete first steps that will increase the success rate of a governance practice?

Once a data governance project is underway, what are some of the common roadblocks that might derail progress?

What are the net benefits to the data team and the organization when a data governance practice is established, active, and healthy? What are the most interesting, innovative, or unexpected ways that you have seen data governance applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data governance/training/coaching? What are some of the pitfalls in data governance? What are some of the future trends in data governance that you are excited by?

Are there any trends that concern you?

Contact Info

Website LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is

Data Migration Strategies For Large Scale Systems

2024-05-27 · Data Engineering Podcast Listen

podcast_episode

by Sriram Panyam , Tobias Macey

Cloud Computing Data Engineering Data Lake Data Lakehouse Delta Hive Iceberg Cyber Security SQL Trino

Summary

Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments. In this episode he shares some of the valuable lessons that he learned about how to make those projects successful.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support. Your host is Tobias Macey and today I'm interviewing Sriram Panyam about his experiences conducting large scale data migrations and the useful strategies that he learned in the process

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing some of your experiences with data migration projects?

As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems?

How would you categorize the different types and motivations of migrations?

How does the motivation for a migration influence the ways that you plan for and execute that work?

Can you talk us through one or two specific projects that you have taken part in? Part 1: The Triggers

Section 1: Technical Limitations triggering Data Migration

Scaling bottlenecks: Performance issues with databases, storage, or network infrastructure Legacy compatibility: Difficulties integrating with modern tools and cloud platforms System upgrades: The need to migrate data during major software changes (e.g., SQL Server version upgrade)

Section 2: Types of Migrations for Infrastructure Focus

Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.) Data center migration: Physical relocation or consolidation of data centers Virtualization migration: Moving from physical servers to virtual machines (or vice versa)

Section 3: Technical Decisions Driving Data Migrations

End-of-life support: Forced migration when older software or hardware is sunsetted Security and compliance: Adopting new platforms with better security postures Cost Optimization: Potential savings of cloud vs. on-premise data centers

Part 2: Challenges (and Anxieties)

Section 1: Technical Challenges

Data transformation challenges: Schema changes, complex data mappings Network bandwidth and latency: Transferring large datasets efficiently Performance tes

The impact of AI in M&A with Richard H Harris MBE

2024-05-23 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by Richard H Harris MBE (Corum Group) , Jason Foster (Cynozure)

AI/ML Analytics

Join Jason Foster for an insightful discussion with Richard H Harris MBE, a technology M&A expert and Senior Vice President M&A at Corum Group. In this podcast, Richard shares his experiences and insights on the impact of AI on mergers and acquisitions (M&A). With a background in theoretical physics, experience with data, and a passion for advancing technology, Richard brings a unique perspective to the table. Tune in now and discover how Richard navigated the challenges of integrating AI into traditional M&A processes and the strategies he used to drive successful outcomes.

Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. They work with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and change management and leadership. The company was named one of The Sunday Times' fastest-growing private companies in 2022 and 2023 and named the Best Place to Work in Data by DataIQ in 2023. For more information, visit www.cynozure.com.Check out our free AI Scorecard and we'll send you a personalised report that outlines what's needed to drive innovation to your business and be competitive.

Zenlytic Is Building You A Better Coworker With AI Agents

2024-05-19 · Data Engineering Podcast Listen

podcast_episode

by Ryan Janssen (Zenlytic) , Paul Blankley (Zenlytic) , Tobias Macey

AI/ML BI Data Engineering Data Lake Data Lakehouse Delta GenAI Hive Iceberg Trino

Summary

The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving landscape of generative AI and unpack the difference between an AI chatbot and an AI agent.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ryan Janssen and Paul Blankley about their experiences building AI powered agents for interacting with your data

Interview

Introduction How did you get involved in data? In AI? Can you describe what Zenlytic is and the role that AI is playing in your platform? What have been the key stages in your AI journey?

What are some of the dead ends that you ran into along the path to where you are today? What are some of the persistent challenges that you are facing?

So tell us more about data agents. Firstly, what are data agents and why do you think they're important? How are data agents different from chatbots? Are data agents harder to build? How do you make them work in production? What other technical architectures have you had to develop to support the use of AI in Zenlytic? How have you approached the work of customer education as you introduce this functionality? What are some of the most interesting or erroneous misconceptions that you have heard about what the AI can and can't do? How have you balanced accuracy/trustworthiness with user experience and flexibility in the conversational AI, given the potential for these models to create erroneous responses? What are the most interesting, innovative, or unexpected ways that you have seen your AI agent used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI agent for business intelligence? When is an AI agent the wrong choice? What do you have planned for the future of AI in the Zenlytic product?

Contact Info

Ryan

Paul

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announce

Release Management For Data Platform Services And Logic

2024-05-12 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey

AI/ML Airbyte Dagster Data Engineering Data Lake Data Lakehouse dbt Delta Hive Iceberg Python Snowflake +2 more

Summary

Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I want to talk about my experiences managing the QA and release management process of my data platform

Interview

Introduction As a team, our overall goal is to ensure that the production environment for our data platform is highly stable and reliable. This is the foundational element of establishing and maintaining trust with the consumers of our data. In order to support this effort, we need to ensure that only changes that have been tested and verified are promoted to production. Our current challenge is one that plagues all data teams. We want to have an environment that mirrors our production environment that is available for testing, but it’s not feasible to maintain a complete duplicate of all of the production data. Compounding that challenge is the fact that each of the components of our data platform interact with data in slightly different ways and need different processes for ensuring that changes are being promoted safely.

Contact Info

LinkedIn Website

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

Data Platforms and Leaky Abstractions Episode Building A Data Platform From Scratch Airbyte

Podcast Episode

Trino dbt Starburst Galaxy Superset Dagster LakeFS

Podcast Episode

Nessie

Podcast Episode

Iceberg Snowflake LocalStack DSL == Domain Specific Language

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-S

How Expedia uses data to power its business relationships with Raegan Armstrong

2024-05-09 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by Raegan Armstrong (Expedia Group)

AI/ML Analytics

Curious about how Expedia leverages data to fuel its vast network of partnerships in the travel industry? Join us as we dive into the world of global travel partnerships with Raegan Armstrong, Senior Director of Global Airline Partnerships at Expedia Group. In this discussion, Raegan sheds light on the critical role of data in driving strategic decisions and operational agility. Tune in now and discover how Expedia utilises data-driven insights to identify growth opportunities, tackle global challenges, and ensure seamless experiences for partners and travellers.

Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. They work with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and change management and leadership. The company was named one of The Sunday Times' fastest-growing private companies in 2022 and 2023 and named the Best Place to Work in Data by DataIQ in 2023. For more information, visit www.cynozure.com.

ArcGIS Pro 3.x Cookbook - Second Edition

2024-05-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tripp Corbin, GISP

GIS arcgis data data-engineering geographic-information-system-gis location-data

ArcGIS Pro 3.x Cookbook teaches you to master the powerful tools available in Esri's ArcGIS Pro application for geospatial data management and analysis. You'll discover practical recipes that guide you through creating, editing, visualizing, and analyzing GIS data in 2D and 3D. Whether you are transitioning from ArcMap or starting fresh, this book will empower you to build impressive geospatial projects. What this Book will help me do Navigate and make effective use of the ArcGIS Pro user interface and tools. Create, edit, and publish detailed 2D and 3D geospatial maps. Manage data efficiently using geodatabases, relationships, and topology tools. Perform comprehensive spatial analyses including proximity, clustering, and 3D analysis. Apply geospatial data validation techniques to ensure data consistency and integrity. Author(s) Tripp Corbin, GISP, is an experienced Geographic Information Systems Professional with extensive expertise in Esri's GIS ecosystem. Tripp has taught numerous colleagues about ArcGIS Pro and its capabilities, bringing clarity and focus to complex GIS concepts. His engaging teaching style and comprehensive technical knowledge make this book both helpful and approachable for readers. Who is it for? This book is designed for GIS professionals, geospatial analysts, and technicians looking to expand their skills with ArcGIS Pro. It's well-suited for architects and specialists who want to visualize, analyze, and create GIS projects effectively. Beginner GIS users will find clear guidance without needing prior experience, and experienced ArcMap users will learn how to transition smoothly to ArcGIS Pro.

#201 The Database is the Operating System with Mike Stonebraker, CTO & Co-Founder At DBOS

2024-04-25 · DataFramed Listen

podcast_episode

by Mike Stonebraker (DBOS)

AI/ML Analytics Cloud Computing Computer Science RDBMS Cyber Security SQL Vertica postgresql

Databases are ubiquitous, and you don’t need to be a data practitioner to know that all data everywhere is stored in a database—or is it? While the majority of data around the world lives in a database, the data that helps run the heart of our operating systems—the core functions of our computers— is not stored in the same place as everywhere else. This is due to database storage sitting ‘above’ the operating system, requiring the OS to run before the databases can be used. But what if the OS was built ‘on top’ of a database? What difference could this fundamental change make to how we use computers? Mike Stonebraker is a distinguished computer scientist known for his foundational work in database systems, he is also currently CTO & Co-Founder At DBOS. His extensive career includes significant contributions through academic prototypes and commercial startups, leading to the creation of several pivotal relational database companies such as Ingres Corporation, Illustra, Paradigm4, StreamBase Systems, Tamr, Vertica, and VoltDB. Stonebraker's role as chief technical officer at Informix and his influential research earned him the prestigious 2014 Turing Award. Stonebraker's professional journey spans two major phases: initially at the University of California, Berkeley, focusing on relational database management systems like Ingres and Postgres, and later, from 2001 at the Massachusetts Institute of Technology (MIT), where he pioneered advanced data management techniques including C-Store, H-Store, SciDB, and DBOS. He remains a professor emeritus at UC Berkeley and continues to influence as an adjunct professor at MIT’s Computer Science and Artificial Intelligence Laboratory. Stonebraker is also recognized for his editorial work on the book "Readings in Database Systems." In the episode, Richie and Mike explore the the success of PostgreSQL, the evolution of SQL databases, the shift towards cloud computing and what that means in practice when migrating to the cloud, the impact of disaggregated storage, software and serverless trends, the role of databases in facilitating new data and AI trends, DBOS and it’s advantages for security, and much more. Links Mentioned in the Show: DBOSPaper: What Goes Around Comes Around[Course] Understanding Cloud ComputingRelated Episode: Scaling Enterprise Analytics with Libby Duane Adams, Chief Advocacy Officer and Co-Founder of AlteryxRewatch sessions from RADAR: The Analytics Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Making Email Better With AI At Shortwave

2024-04-21 · Data Engineering Podcast Listen

podcast_episode

by Andrew Lee (Shortwave) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Delta GenAI Hudi Iceberg Python +3 more

Summary

Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrew Lee about his work on Shortwave, an AI powered email client

Interview

Introduction How did you get involved in the area of data management? Can you describe what Shortwave is and the story behind it?

What is the core problem that you are addressing with Shortwave?

Email has been a central part of communication and business productivity for decades now. What are the overall themes that continue to be problematic? What are the strengths that email maintains as a protocol and ecosystem? From a product perspective, what are the data challenges that are posed by email? Can you describe how you have architected the Shortwave platform?

How have the design and goals of the product changed since you started it? What are the ways that the advent and evolution of language models have influenced your product roadmap?

How do you manage the personalization of the AI functionality in your system for each user/team? For users and teams who are using Shortwave, how does it change their workflow and communication patterns? Can you describe how I would use Shortwave for managing the workflow of evaluating, planning, and promoting my podcast episodes? What are the most interesting, innovative, or unexpected ways that you have seen Shortwave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Shortwave? When is Shortwave the wrong choice? What do you have planned for the future of Shortwave?

Contact Info

LinkedIn Blog

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with mach

#46 Debunking Devon, Exploring RAG Frameworks, and Tech for a Better World

2024-04-17 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

by Martin Van Mollekot

AI/ML Data Science GenAI LLM Pinecone RAG

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style!

In this episode, titled, "#46 Debunking Devon, Exploring RAG Frameworks, and Tech for a Better World", our special guest Martin Van Mollekot adds a rich layer of insight to our tech stew, covering everything from 3D-printed humanoids to the harmonious blend of AI and music, all while exploring how tech is cultivating a better world. 3D Printing: Martin discusses building a humanoid using resources from Thingiverse.AI Generated Music: Exploring Udio, an AI that not only composes music but adds vocals to match your taste.Devin Debunked: Unpacking the claims of the "First AI Software Engineer" and why it's not quite time to worry about AI taking coding jobs.GPT-4 Over Humans? A critical look at whether AI could replace junior analysts in the current tech landscape.The Data Science Dilemma: Is Data Science Dead? Discussing the evolution and future relevance of data science, with Zapier highlighted for its accessible toolset.RAG Frameworks Galore:. Discover the evolving buffet of RAG frameworks, making data handling smoother – and whether they're up to the hype: Ragflow, Pine Cone, Verba, and R2R. Tech for a Better World: Martin shares his personal story of how computer vision technology can aid farmers in managing their livestock.Hip-Hop and Generative AI: How generative AI is stirring up the music industry & tips from Bart on reproducing hit tracks.The Low-Code Revolution: Martin shares his insights on the rise of low-code/no-code platforms in data management.

Designing A Non-Relational Database Engine

2024-04-14 · Data Engineering Podcast Listen

podcast_episode

by Oren Eini (RavenDB) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Quality Datafold dbt Delta Hudi +5 more

Summary

Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Oren Eini about the work of designing and building a NoSQL database engine

Interview

Introduction How did you get involved in the area of data management? Can you describe what constitutes a NoSQL database?

How have the requirements and applications of NoSQL engines changed since they first became popular ~15 years ago?

What are the factors that convince teams to use a NoSQL vs. SQL database?

NoSQL is a generalized term that encompasses a number of different data models. How does the underlying representation (e.g. document, K/V, graph) change that calculus?

How have the evolution in data formats (e.g. N-dimensional vectors, point clouds, etc.) changed the landscape for NoSQL engines? When designing and building a database, what are the initial set of questions that need to be answered?

How many "core capabilities" can you reasonably design around before they conflict with each other?

How have you approached the evolution of RavenDB as you add new capabilities and mature the project?

What are some of the early decisions that had to be unwound to enable new capabilities?

If you were to start from scratch today, what database would you build? What are the most interesting, innovative, or unexpected ways that you have seen RavenDB/NoSQL databases used? What are the most interesting, unexpected, or challenging lessons t

BigQuery Studio and BigFrames: Unlocking the Power of Scalable Data Science

2024-04-11 · Google Cloud Next '24

session

by Jeff Nelson (Google Cloud)

AI/ML Analytics API BigQuery Cloud Computing Data Science GCP Python

BigQuery Studio and BigFrames are a powerful combination for scalable data science and analytics. Unify data management, analysis, and collaboration with BigQuery Studio’s intuitive interface. Scale data science and machine learning with BigFrames’ powerful Python API. Get deeper insights, faster.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

talk-data.com

Data Management

Activity Trend

Top Events

Top Speakers

Being Data Driven At Stripe With Trino And Iceberg

Data + AI Summit Keynote Day 1 - Full

Python and SQL Bible

145 - Data Product Success: Adopting a Customer-Centric Approach With Malcolm Hawker, Head of Data Management at Profisee

X-Ray Vision For Your Flink Stream Processing With Datorios

#213 Building Trust Through Data with Prukalpa Sankar, Co-Founder of Atlan

Assessing the impact of algorithmic systems with Tamara Kneese

Generative AI in action: Datalex's innovative approach (L300) | AWS Events

AWSEvents #GenerativeAI #AI #Cloud #AWSAIandDataConference

Practical First Steps In Data Governance For Long Term Success

Data Migration Strategies For Large Scale Systems

The impact of AI in M&A with Richard H Harris MBE

Zenlytic Is Building You A Better Coworker With AI Agents

Release Management For Data Platform Services And Logic

How Expedia uses data to power its business relationships with Raegan Armstrong

ArcGIS Pro 3.x Cookbook - Second Edition

#201 The Database is the Operating System with Mike Stonebraker, CTO & Co-Founder At DBOS

Making Email Better With AI At Shortwave

#46 Debunking Devon, Exploring RAG Frameworks, and Tech for a Better World

Designing A Non-Relational Database Engine

BigQuery Studio and BigFrames: Unlocking the Power of Scalable Data Science