Why is meaningful innovation so hard in insurance, and what can be done about it? In this episode of Hub & Spoken, host Jason Foster is joined by John Turner, a global underwriting leader and expert in life and health insurance, to explore the innovation imperative in one of the world's most traditional (and risk-averse) industries. They unpack the complex cultural, structural, and regulatory challenges that make change difficult in insurance, from siloed teams and outdated processes to over-engineered tech solutions that miss the mark. But they also spotlight the opportunities: from automation that enhances the customer journey to cross-functional collaboration that drives real transformation. 💡 Key talking points include: Why conservative cultures make innovation harder — but not impossible The hidden flaws in traditional underwriting and risk selection How data, automation and behaviour change can unlock new growth The clash between insurers and insurtechs — and how to bridge it What truly customer-led innovation looks like Whether you're in insurance, financial services, or just trying to drive change in a legacy-heavy environment, this episode is packed with ideas, reflections, and real-world experience you can learn from. **** Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. It works with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and data leadership. The company was named one of The Sunday Times' fastest-growing private companies in both 2022 and 2023 and recognised as The Best Place to Work in Data by DataIQ in 2023 and 2024. Cynozure is a certified B Corporation.
talk-data.com
Topic
Data Management
1097
tagged
Activity Trend
Top Events
Summary In this episode of the Data Engineering Podcast Andy Warfield talks about the innovative functionalities of S3 Tables and Vectors and their integration into modern data stacks. Andy shares his journey through the tech industry and his role at Amazon, where he collaborates to enhance storage capabilities, discussing the evolution of S3 from a simple storage solution to a sophisticated system supporting advanced data types like tables and vectors crucial for analytics and AI-driven applications. He explains the motivations behind introducing S3 Tables and Vectors, highlighting their role in simplifying data management and enhancing performance for complex workloads, and shares insights into the technical challenges and design considerations involved in developing these features. The conversation explores potential applications of S3 Tables and Vectors in fields like AI, genomics, and media, and discusses future directions for S3's development to further support data-driven innovation.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementTired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to 6x while guaranteeing accuracy? Datafold's Migration Agent is the only AI-powered solution that doesn't just translate your code; it validates every single data point to ensure perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multi-system migrations, they deliver production-ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they're turning months-long migration nightmares into week-long success stories.Your host is Tobias Macey and today I'm interviewing Andy Warfield about S3 Tables and VectorsInterview IntroductionHow did you get involved in the area of data management?Can you describe what your goals are with the Tables and Vector features of S3?How did the experience of building S3 Tables inform your work on S3 Vectors?There are numerous implementations of vector storage and search. How do you view the role of S3 in the context of that ecosystem?The most directly analogous implementation that I'm aware of is the Lance table format. How would you compare the implementation and capabilities of Lance with what you are building with S3 Vectors?What opportunity do you see for being able to offer a protocol compatible implementation similar to the Iceberg compatibility that you provide with S3 Tables?Can you describe the technical implementation of the Vectors functionality in S3?What are the sources of inspiration that you looked to in designing the service?Can you describe some of the ways that S3 Vectors might be integrated into a typical AI application?What are the most interesting, innovative, or unexpected ways that you have seen S3 Tables/Vectors used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on S3 Tables/Vectors?When is S3 the wrong choice for Iceberg or Vector implementations?What do you have planned for the future of S3 Tables and Vectors?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links S3 TablesS3 VectorsS3 ExpressParquetIcebergVector IndexVector DatabasepgvectorEmbedding ModelRetrieval Augmented GenerationTwelveLabsAmazon BedrockIceberg REST CatalogLog-Structured Merge TreeS3 MetadataSentence TransformerSparkTrinoDaftThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Whether it's to adhere to regulations, access markets by meeting specific standards, or devise data analytics and AI strategies, companies today are busy implementing metadata repositories—metadata tools about the IT, data, information, and knowledge in your company. Until now, most of these repositories have been implemented in isolation from one another, but that practice lies at the core of problems with data management in many companies today. Author Ole Olesen-Bagneux, chief evangelist at Actian, shows you how to masterfully manage your metadata repositories by properly coordinating them. That requires a data discovery team to increase insights for all key players in enterprise data management, from the CIO and CDO to enterprise and data architects. Coordinating these repositories will help you and your organization democratize data and excel at data management. This book shows you how. Learn what metadata repositories are and what they do Explore which data to represent in these repositories Set up a data discovery team to make data searchable Learn how to manage and coordinate repositories in a meta grid Increase innovation by setting up a functional data marketplace Make information security and data protection more robust Gain a deeper understanding of your company IT landscape Activate real enterprise architecture based on evidence
Summary In this episode of the Data Engineering Podcast Akshay Agrawal from Marimo discusses the innovative new Python notebook environment, which offers a reactive execution model, full Python integration, and built-in UI elements to enhance the interactive computing experience. He discusses the challenges of traditional Jupyter notebooks, such as hidden states and lack of interactivity, and how Marimo addresses these issues with features like reactive execution and Python-native file formats. Akshay also explores the broader landscape of programmatic notebooks, comparing Marimo to other tools like Jupyter, Streamlit, and Hex, highlighting its unique approach to creating data apps directly from notebooks and eliminating the need for separate app development. The conversation delves into the technical architecture of Marimo, its community-driven development, and future plans, including a commercial offering and enhanced AI integration, emphasizing Marimo's role in bridging the gap between data exploration and production-ready applications.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementTired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to 6x while guaranteeing accuracy? Datafold's Migration Agent is the only AI-powered solution that doesn't just translate your code; it validates every single data point to ensure perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multi-system migrations, they deliver production-ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they're turning months-long migration nightmares into week-long success stories.Your host is Tobias Macey and today I'm interviewing Akshay Agrawal about Marimo, a reusable and reproducible Python notebook environmentInterview IntroductionHow did you get involved in the area of data management?Can you describe what Marimo is and the story behind it?What are the core problems and use cases that you are focused on addressing with Marimo?What are you explicitly not trying to solve for with Marimo?Programmatic notebooks have been around for decades now. Jupyter was largely responsible for making them popular outside of academia. How have the applications of notebooks changed in recent years?What are the limitations that have been most challenging to address in production contexts?Jupyter has long had support for multi-language notebooks/notebook kernels. What is your opinion on the utility of that feature as a core concern of the notebook system?Beyond notebooks, Streamlit and Hex have become quite popular for publishing the results of notebook-style analysis. How would you characterize the feature set of Marimo for those use cases?For a typical data team that is working across data pipelines, business analytics, ML/AI engineering, etc. How do you see Marimo applied within and across those contexts?One of the common difficulties with notebooks is that they are largely a single-player experience. They may connect into a shared compute cluster for scaling up execution (e.g. Ray, Dask, etc.). How does Marimo address the situation where a data platform team wants to offer notebooks as a service to reduce the friction to getting started with analyzing data in a warehouse/lakehouse context?How are you seeing teams integrate Marimo with orchestrators (e.g. Dagster, Airflow, Prefect)?What are some of the most interesting or complex engineering challenges that you have had to address while building and evolving Marimo?\What are the most interesting, innovative, or unexpected ways that you have seen Marimo used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Marimo?When is Marimo the wrong choice?What do you have planned for the future of Marimo?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links MarimoJupyterIPythonStreamlitPodcast.init EpisodeVector EmbeddingsDimensionality ReductionKagglePytestPEP 723 script dependency metadataMatLabVisicalcMathematicaRMarkdownRShinyElixir LivebookDatabricks NotebooksPapermillPluto - Julia NotebookHexDirected Acyclic Graph (DAG)Sumble Kaggle founder Anthony Goldblum's startupRayDaskJupytextnbdevDuckDBPodcast EpisodeIcebergSupersetjupyter-marimo-proxyJupyterHubBinderNixAnyWidgetJupyter WidgetsMatplotlibAltairPlotlyDataFusionPolarsMotherDuckThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
In this episode, we look at the real story behind transformation in data and AI, and why the classic big bang approach often fails to deliver lasting impact. Jason explores when large-scale transformation programmes do make sense, like when you're starting from a fundamentally broken place, or when disruption is the goal. But he also digs into the messy reality of what usually happens: slow delivery, rigid plans, lost trust, and a disconnect between activity and real outcomes. He then makes the case for iterative change. A more human, responsive, and sustainable way to build meaningful transformation over time. With real-world examples and sharp reflections, Jason shares how small, focused steps can create big shifts, and how to blend bold vision with practical delivery. This episode is full of insight for business and data leaders navigating change, delivering transformation, or just trying to make something actually stick. **** Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. It works with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and data leadership. The company was named one of The Sunday Times' fastest-growing private companies in both 2022 and 2023 and recognised as The Best Place to Work in Data by DataIQ in 2023 and 2024. Cynozure is a certified B Corporation.
Summary In this episode of the Data Engineering Podcast Dan Sotolongo from Snowflake talks about the complexities of incremental data processing in warehouse environments. Dan discusses the challenges of handling continuously evolving datasets and the importance of incremental data processing for optimized resource use and reduced latency. He explains how delayed view semantics can address these challenges by maintaining up-to-date results with minimal work, leveraging Snowflake's dynamic tables feature. The conversation also explores the broader landscape of data processing, comparing batch and streaming systems, and highlights the trade-offs between them. Dan emphasizes the need for a unified theoretical framework to discuss semantic guarantees in data pipelines and introduces the concept of delayed view semantics, touching on the limitations of current systems and the potential of dynamic tables to simplify complex data workflows.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Dan Sotolongo about the challenges of incremental data processing in warehouse environments and how delayed view semantics help to address the problemInterview IntroductionHow did you get involved in the area of data management?Can you start by defining the scope of the term "incremental data processing"?What are some of the common solutions that data engineers build when creating workflows to implement that pattern?What are some common difficulties that they encounter in the pursuit of incremental data?Can you describe what delayed view semantics are and the story behind it?What are the problems that DVS explicitly doesn't address?How does the approach that you have taken in Dynamic View Semantics compare to systems like Materialize, Feldera, etc.Can you describe the technical architecture of the implementation of Dynamic Tables?What are the elements of the problem that are as-yet unsolved?How has the implementation changed/evolved as you learned more about the solution space?What would be involved in implementing the delayed view semantics pattern in other dbms engines?For someone who wants to use DVS/Dyamic Tables for managing their incremental data loads, what does the workflow look like?What are the options for being able to apply tests/validation logic to a dynamic table while it is operating?What are the most interesting, innovative, or unexpected ways that you have seen Dynamic Tables used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dynamic Tables/Delayed View Semantics?When are Dynamic Tables/DVS the wrong choice?What do you have planned for the future of Dynamic Tables?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links Delayed View Semantics: Presentation SlidesSnowflakeNumPyIPythonJupyterFlinkSpark StreamingKafkaSnowflake Dynamic TablesAirflowDagsterStreaming WatermarksMaterializeFelderaACIDCAP Theorem)LinearizabilitySerializable ConsistencySIGMODMaterialized ViewsdbtData VaultApache IcebergDatabricks DeltaHudiDead Letter Queuepg_ivmProperty Based TestingIceberg V3 Row LineagePrometheusThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Summary In this episode of the Data Engineering Podcast Kacper Łukawski from Qdrant about integrating MCP servers with vector databases to process unstructured data. Kacper shares his experience in data engineering, from building big data pipelines in the automotive industry to leveraging large language models (LLMs) for transforming unstructured datasets into valuable assets. He discusses the challenges of building data pipelines for unstructured data and how vector databases facilitate semantic search and retrieval-augmented generation (RAG) applications. Kacper delves into the intricacies of vector storage and search, including metadata and contextual elements, and explores the evolution of vector engines beyond RAG to applications like semantic search and anomaly detection. The conversation covers the role of Model Context Protocol (MCP) servers in simplifying data integration and retrieval processes, highlighting the need for experimentation and evaluation when adopting LLMs, and offering practical advice on optimizing vector search costs and fine-tuning embedding models for improved search quality.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Kacper Łukawski about how MCP servers can be paired with vector databases to streamline processing of unstructured dataInterview IntroductionHow did you get involved in the area of data management?LLMs are enabling the derivation of useful data assets from unstructured sources. What are the challenges that teams face in building the pipelines to support that work?How has the role of vector engines grown or evolved in the past ~2 years as LLMs have gained broader adoption?Beyond its role as a store of context for agents, RAG, etc. what other applications are common for vector databaes?In the ecosystem of vector engines, what are the distinctive elements of Qdrant?How has the MCP specification simplified the work of processing unstructured data?Can you describe the toolchain and workflow involved in building a data pipeline that leverages an MCP for generating embeddings?helping data engineers gain confidence in non-deterministic workflowsbringing application/ML/data teams into collaboration for determining the impact of e.g. chunking strategies, embedding model selection, etc.What are the most interesting, innovative, or unexpected ways that you have seen MCP and Qdrant used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on vector use cases?When is MCP and/or Qdrant the wrong choice?What do you have planned for the future of MCP with Qdrant?Contact Info LinkedInTwitter/XPersonal websiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links QdrantKafkaApache OoziNamed Entity RecognitionGraphRAGpgvectorElasticsearchApache LuceneOpenSearchBM25Semantic SearchMCP == Model Context ProtocolAnthropic Contextualized ChunkingCohereThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
The rapid growth of scientific data repositories demands innovative solutions for efficient metadata creation. In this talk, we present our open-source project that leverages large language models to automate the generation of standard-compliant metadata files from raw scientific datasets. Our approach harnesses the capabilities of pre-trained open source models, finetuned with domain-specific data, and integrated with Langgraph to orchestrate a modular, end-to-end pipeline capable of ingesting heterogeneous raw data files and outputting metadata conforming to specific standards.
The methodology involves a multi-stage process where raw data is first parsed and analyzed by the LLM to extract relevant scientific and contextual information. This information is then structured into metadata templates that adhere strictly to recognized standards, thereby reducing human error and accelerating the data release cycle. We demonstrate the effectiveness of our approach using the USGS ScienceBase repository, where we have successfully generated metadata for a variety of scientific datasets, including images, time series, and text data.
Beyond its immediate application to the USGS ScienceBase repository, our open-source framework is designed to be extensible, allowing adaptation to other data release processes across various scientific domains. We will discuss the technical challenges encountered, such as managing diverse data formats and ensuring metadata quality, and outline strategies for community-driven enhancements. This work not only streamlines the metadata creation workflow but also sets the stage for broader adoption of generative AI in scientific data management.
Additional Material: - Project supported by USGS and ORNL - Codebase will be available on GitHub after paper publication - Fine-tuned LLM models will be available on Hugginface after paper publication
After two decades of planning, Rubin Observatory is finally observing the sky. Built to image the entire southern hemisphere every few nights with a 3.2-gigapixel camera, Rubin will produce a time-lapse of the Universe, revealing moving asteroids, pulsing stars, supernovae, and rare transients that you only catch if you're always watching.
In this talk, I'll share the “first look” images from Rubin Observatory as well as what it took to get here: from scalable algorithms to infrastructure that moves data from a mountaintop in Chile to scientists around the world in seconds. I'll reflect on what we learned building the data management system in Python over the years, including stories of choices that impacted scalability, interfaces, and maintainability. Rubin Observatory is here. And it's for you.
In this episode of Hub & Spoken, Jason Foster talks to Cali Wood, Head of Data and AI Strategy & Culture at AXA UK and Ireland. Cali shares how AXA is shaping its data and AI transformation through a clear strategic framework built on creation of value, connection of data and tooling, and culture to accelerate value. From embedding human-centred design into automation use cases to launching a data and AI academy with more than 50% workforce engagement, AXA is making data and AI a true business-wide initiative. This conversation explores: The three pillars of AXA's data and AI strategy How culture and leadership unlock real business value Scaling responsible AI across a highly regulated industry Evolving from traditional to agentic AI in a people-first way Whether you're leading data transformation or navigating GenAI, this episode offers practical ideas and inspiration to help bring your people and strategy together. Listen now to learn how to build AI-driven change - the right way.
Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. It works with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and data leadership. The company was named one of The Sunday Times' fastest-growing private companies in both 2022 and 2023 and recognised as The Best Place to Work in Data by DataIQ in 2023 and 2024. Cynozure is a certified B Corporation.
Deliver flexible, scalable, and high-performance data storage that's perfect for AI and other modern applications with MongoDB 8.0 and MongoDB Atlas multi-cloud data platform. In MongoDB 8.0 in Action, Third Edition you'll find comprehensive coverage of the latest version of MongoDB 8.0 and the MongoDB Atlas multi-cloud data platform. Learn to utilize MongoDB’s flexible schema design for data modeling, scale applications effectively using advanced sharding features, integrate full-text and vector-based semantic search, and more. This totally revised new edition delivers engaging hands-on tutorials and examples that put MongoDB into action! In MongoDB 8.0 in Action, Third Edition you'll: Master new features in MongoDB 8.0 Create your first, free Atlas cluster using the Atlas CLI Design scalable NoSQL databases with effective data modeling techniques Master Vector Search for building GenAI-driven applications Utilize advanced search capabilities in MongoDB Atlas, including full-text search Build Event-Driven Applications with Atlas Stream Processing Deploy and manage MongoDB Atlas clusters both locally and in the cloud using the Atlas CLI Leverage the Atlas SQL interface for familiar SQL querying Use MongoDB Atlas Online Archive for efficient data management Establish robust security practices including encryption Master backup and restore strategies Optimize database performance and identify slow queries MongoDB 8.0 in Action, Third Edition offers a clear, easy-to-understand introduction to everything in MongoDB 8.0 and MongoDB Atlas—including new advanced features such as embedded config servers in sharded clusters, or moving an unsharded collection to a different shard. The book also covers Atlas stream processing, full text search, and vector search capabilities for generative AI applications. Each chapter is packed with tips, tricks, and practical examples you can quickly apply to your projects, whether you're brand new to MongoDB or looking to get up to speed with the latest version. About the Technology MongoDB is the database of choice for storing structured, semi-structured, and unstructured data like business documents and other text and image files. MongoDB 8.0 introduces a range of exciting new features—from sharding improvements that simplify the management of distributed data, to performance enhancements that stay resilient under heavy workloads. Plus, MongoDB Atlas brings vector search and full-text search features that support AI-powered applications. About the Book MongoDB 8.0 in Action, Third Edition you’ll learn how to take advantage of all the new features of MongoDB 8.0, including the powerful MongoDB Atlas multi-cloud data platform. You’ll start with the basics of setting up and managing a document database. Then, you’ll learn how to use MongoDB for AI-driven applications, implement advanced stream processing, and optimize performance with improved indexing and query handling. Hands-on projects like creating a RAG-based chatbot and building an aggregation pipeline mean you’ll really put MongoDB into action! What's Inside The new features in MongoDB 8.0 Get familiar with MongoDB’s Atlas cloud platform Utilizing sharding enhancements Using vector-based search technologies Full-text search capabilities for efficient text indexing and querying About the Reader For developers and DBAs of all levels. No prior experience with MongoDB required. About the Author Arek Borucki is a MongoDB Champion, certified MongoDB and MongoDB Atlas administrator with expertise in distributed systems, NoSQL databases, and Kubernetes. Quotes An excellent resource with real-world examples and best practices to design, optimize, and scale modern applications. - Advait Patel, Broadcom Essential MongoDB resource. Covers new features such as full-text search, vector search, AI, and RAG applications. - Juan Roy, Credit Suisse Reflects author’s practical experience and clear teaching style. It’s packed with real-world examples and up-to-date insights. - Rajesh Nair, MongoDB Champion & community leader This book will definitely make you a MongoDB star! - Vinicios Wentz, JP Morgan & Chase Co.
Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset. These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data. They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless. FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden. Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.
In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit. Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions. We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management.
Summary In this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing data quality with delivery speed, and the socio-technical challenges of building a foundational data platform that supports research and operational needs while maintaining regulatory compliance and data quality. Effie also shares insights into treating data as code, leveraging modern data warehouses, and the evolving role of data engineers in a rapidly changing technological landscape.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Your host is Tobias Macey and today I'm interviewing Effie Baram about data engineering in the finance sectorInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the role of data in the context of Two Sigma?What are some of the key characteristics of the types of data sources that you work with?Your role is leading "foundational data engineering" at Two Sigma. Can you unpack that title and how it shapes the ways that you think about what you build?How does the concept of "foundational data" influence the ways that the business thinks about the organizational patterns around data?Given the regulatory environment around finance, how does that impact the ways that you think about the "what" and "how" of the data that you deliver to data consumers?Being the foundational team for data use at Two Sigma, how have you approached the design and architecture of your technical systems?How do you think about the boundaries between your responsibilities and the rest of the organization?What are the design patterns that you have found most helpful in empowering data consumers to build on top of your work?What are some of the elements of sociotechnical friction that have been most challenging to address?What are the most interesting, innovative, or unexpected ways that you have seen the ideas around "foundational data" applied in your organization?What are the most interesting, unexpected, or challenging lessons that you have learned while working with financial data?When is a foundational data team the wrong approach?What do you have planned for the future of your platform design?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links 2SigmaReliability EngineeringSLA == Service-Level AgreementAirflowParquet File FormatBigQuerySnowflakedbtGemini AssistMCP == Model Context ProtocoldtraceThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Jason Bryll is a healthcare analytics expert and hiring manager with nearly two decades of experience. In this episode, Jason explains what healthcare analytics entails, why it's essential, and the role of AI in the field. More importantly, you'll learn how to stand out to hiring managers-- even in today's market! Wanna dive further into healthcare analytics? Here's your next podcast: 👉 https://datacareerpodcast.com/episode/160-she-became-a-data-analyst-after-a-20-year-career-in-physical-therapy-melody-santos 💌 Join 10k+ aspiring data analysts & get my tips in your inbox weekly 👉 https://www.datacareerjumpstart.com/newsletter 🆘 Feeling stuck in your data journey? Come to my next free "How to Land Your First Data Job" training 👉 https://www.datacareerjumpstart.com/training 👩💻 Want to land a data job in less than 90 days? 👉 https://www.datacareerjumpstart.com/daa 👔 Ace The Interview with Confidence 👉 https://www.datacareerjumpstart.com/interviewsimulator ⌚ TIMESTAMPS 00:00 Introduction to Healthcare Analytics 00:27 Jason's Career Journey 02:23 What Is Healthcare Analytics? 06:37 Parable Associates 10:56 Understanding Revenue Cycle and Accounts Receivable 15:00 Complexities in Healthcare Data Management 19:47 The Importance of Domain Knowledge 27:12 The Importance of Building a Strong Portfolio 31:43 Recommended Data Tools and Platforms 34:10 Advice To Become A Healthcare Analyst 🔗 CONNECT WITH JASON BRYLL 🎥 YouTube Channel: https://www.youtube.com/@UCGh1LOrX0mWuoWZk5J10zkw 🤝 LinkedIn: https://www.linkedin.com/in/jason-bryll/ 📸 Instagram: https://www.instagram.com/parable_associates/ 💻 Website: https://parableassociates.com/ Check out Jason's Healthcare Analyst courses here: 👉 https://www.parableacademy.com/link/d7GlNy?url=https%3A%2F%2Fwww.parableacademy.com%2Fcourse%3Fcourseid%3Drcm-analyst 🔗 CONNECT WITH AVERY 🎥 YouTube Channel: https://www.youtube.com/@averysmith 🤝 LinkedIn: https://www.linkedin.com/in/averyjsmith/ 📸 Instagram: https://instagram.com/datacareerjumpstart 🎵 TikTok: https://www.tiktok.com/@verydata 💻 Website: https://www.datacareerjumpstart.com/ Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!
To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more
If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.
👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa
Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection. Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.
At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.
Summary In this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agent systems, Arun shares insights on building agentic systems at an organizational scale, highlighting the importance of robust models, data connectivity, and orchestration loops. Listen in as he discusses the challenges of managing data context and cost in large-scale agent systems, the need for a unified context management platform to prevent data silos, and the potential for open-source projects like LMOS to provide a foundational substrate for agentic use cases that can transform enterprise architectures by enabling more efficient data management and decision-making processes.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Your host is Tobias Macey and today I'm interviewing Arun Joseph about building an agent platform to empower the business to adopt agentic capabilitiesInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of how Deutsche Telekom has been approaching applications of generative AI?What are the key challenges that have slowed adoption/implementation?Enabling non-engineering teams to define and manage AI agents in production is a challenging goal. From a data engineering perspective, what does the abstraction layer for these teams look like? How do you manage the underlying data pipelines, versioning of agents, and monitoring of these user-defined agents?What was your process for developing the architecture and interfaces for what ultimately became the LMOS?How do the principles of operatings systems help with managing the abstractions and composability of the framework?Can you describe the overall architecture of the LMOS?What does a typical workflow look like for someone who wants to build a new agent use case?How do you handle data discovery and embedding generation to avoid unnecessary duplication of processing?With your focus on openness and local control, how do you see your work complementing projects like OumiWhat are the most interesting, innovative, or unexpected ways that you have seen LMOS used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on LMOS?When is LMOS the wrong choice?What do you have planned for the future of LMOS and MASAIC?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links LMOSDeutsche TelekomMASAICOpenAI Agents SDKRAG == Retrieval Augmented GenerationLangChainMarvin MinskyVector DatabaseMCP == Model Context ProtocolA2A (Agent to Agent) ProtocolQdrantLlamaIndexDVC == Data Version ControlKubernetesKotlinIstioXerox PARC)OODA (Observe, Orient, Decide, Act) LoopThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
In this solo episode, Cynozure CEO Jason Foster explores what it really means to create value with data and AI and why it can't be treated as a nice-to-have outcome at the end of a project. Jason breaks down a practical, repeatable approach to designing value in from the start, with clear intent, strong foundations, and input-focused delivery. He shares real-world examples and analogies to show how organisations can shift from vague goals to measurable, meaningful impact. This episode is packed with actionable insights for data and business leaders who want to move from theory to practice and ensure their data work truly makes a difference.
Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. It works with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and data leadership. The company was named one of The Sunday Times' fastest-growing private companies in both 2022 and 2023 and recognised as The Best Place to Work in Data by DataIQ in 2023 and 2024. Cynozure is a certified B Corporation.
In this episode of Data Unchained, we sit down with Malcolm Hawker, former Gartner analyst and Chief Data Officer at Profisee, to expose the real barriers to AI adoption. We explore why Master Data Management (MDM) is the foundation enterprises overlook, how decentralized systems and unstructured data derail governance, and why CDOs must evolve their role or risk irrelevance. This conversation challenges the myth of a single source of truth, breaks down the politics of data ownership, and offers a new vision for aligning data strategy with AI innovation.
AIReadiness #MasterDataManagement #DataGovernance #CDOInsights #EnterpriseAI #DataStrategy #UnstructuredData #DataInfrastructure #DigitalTransformation #AILeadership #DataUnchained #Profisee #MalcolmHawker #MollyPresley #TechInnovation
Cyberpunk by jiglr | https://soundcloud.com/jiglrmusic Music promoted by https://www.free-stock-music.com Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/deed.en_US Hosted on Acast. See acast.com/privacy for more information.
Metadata, data quality and data observability tools provide significant capabilities to ensure good data for your BI and AI initiatives. Metadata tools help discover, and inventory your data assets. Data quality tools help business users manage their data at sources by setting rules and policies. Data observability tools give organizations integrated visibility over the health of data, data pipeline and data landscape. Together the tools help organizations lay good foundation in data management for BI and AI initiatives.