Data Collection

Foundational Data Engineering At Two Sigma

2025-07-06 · Data Engineering Podcast Listen

podcast_episode

by Effie Baram (Two Sigma) , Tobias Macey

AI/ML API Data Engineering Data Management Data Quality Datafold Python

Summary In this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing data quality with delivery speed, and the socio-technical challenges of building a foundational data platform that supports research and operational needs while maintaining regulatory compliance and data quality. Effie also shares insights into treating data as code, leveraging modern data warehouses, and the evolving role of data engineers in a rapidly changing technological landscape.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Your host is Tobias Macey and today I'm interviewing Effie Baram about data engineering in the finance sectorInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the role of data in the context of Two Sigma?What are some of the key characteristics of the types of data sources that you work with?Your role is leading "foundational data engineering" at Two Sigma. Can you unpack that title and how it shapes the ways that you think about what you build?How does the concept of "foundational data" influence the ways that the business thinks about the organizational patterns around data?Given the regulatory environment around finance, how does that impact the ways that you think about the "what" and "how" of the data that you deliver to data consumers?Being the foundational team for data use at Two Sigma, how have you approached the design and architecture of your technical systems?How do you think about the boundaries between your responsibilities and the rest of the organization?What are the design patterns that you have found most helpful in empowering data consumers to build on top of your work?What are some of the elements of sociotechnical friction that have been most challenging to address?What are the most interesting, innovative, or unexpected ways that you have seen the ideas around "foundational data" applied in your organization?What are the most interesting, unexpected, or challenging lessons that you have learned while working with financial data?When is a foundational data team the wrong approach?What do you have planned for the future of your platform design?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links 2SigmaReliability EngineeringSLA == Service-Level AgreementAirflowParquet File FormatBigQuerySnowflakedbtGemini AssistMCP == Model Context ProtocoldtraceThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Enabling Agents In The Enterprise With A Platform Approach

2025-06-29 · Data Engineering Podcast Listen

podcast_episode

by Arun Joseph (Deutsche Telekom) , Tobias Macey

AI/ML API Data Engineering Data Management Data Quality Datafold GenAI Python

Summary In this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agent systems, Arun shares insights on building agentic systems at an organizational scale, highlighting the importance of robust models, data connectivity, and orchestration loops. Listen in as he discusses the challenges of managing data context and cost in large-scale agent systems, the need for a unified context management platform to prevent data silos, and the potential for open-source projects like LMOS to provide a foundational substrate for agentic use cases that can transform enterprise architectures by enabling more efficient data management and decision-making processes.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Your host is Tobias Macey and today I'm interviewing Arun Joseph about building an agent platform to empower the business to adopt agentic capabilitiesInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of how Deutsche Telekom has been approaching applications of generative AI?What are the key challenges that have slowed adoption/implementation?Enabling non-engineering teams to define and manage AI agents in production is a challenging goal. From a data engineering perspective, what does the abstraction layer for these teams look like? How do you manage the underlying data pipelines, versioning of agents, and monitoring of these user-defined agents?What was your process for developing the architecture and interfaces for what ultimately became the LMOS?How do the principles of operatings systems help with managing the abstractions and composability of the framework?Can you describe the overall architecture of the LMOS?What does a typical workflow look like for someone who wants to build a new agent use case?How do you handle data discovery and embedding generation to avoid unnecessary duplication of processing?With your focus on openness and local control, how do you see your work complementing projects like OumiWhat are the most interesting, innovative, or unexpected ways that you have seen LMOS used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on LMOS?When is LMOS the wrong choice?What do you have planned for the future of LMOS and MASAIC?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links LMOSDeutsche TelekomMASAICOpenAI Agents SDKRAG == Retrieval Augmented GenerationLangChainMarvin MinskyVector DatabaseMCP == Model Context ProtocolA2A (Agent to Agent) ProtocolQdrantLlamaIndexDVC == Data Version ControlKubernetesKotlinIstioXerox PARC)OODA (Observe, Orient, Decide, Act) LoopThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Dagster's New Era: Modularizing Data Transformation in the Age of AI

2025-06-18 · Data Engineering Podcast Listen

podcast_episode

by Nick Schrock (Elementl) , Tobias Macey

AI/ML Analytics API Dagster Dashboard Data Contracts Data Engineering Data Management Data Quality Datafold KPI LLM

Summary In this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insights on how it will ultimately enhance productivity and expand software engineering's scope. He delves into the current state of AI adoption, the importance of maintaining core data engineering principles, and the need for human oversight when leveraging AI tools effectively. Nick also introduces Dagster's new components feature, designed to modularize and standardize data transformation processes, making it easier for teams to collaborate and integrate AI into their workflows. Join in to explore the future of data engineering, the potential for AI to abstract away complexity, and the importance of open standards in preventing walled gardens in the tech industry.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and today I'm interviewing Nick Schrock about lowering the barrier to entry for data platform consumersInterview IntroductionHow did you get involved in the area of data management?Can you start by giving your summary of the impact that the tidal wave of AI has had on data platforms and data teams?For anyone who hasn't heard of Dagster, can you give a quick summary of the project?What are the notable changes in the Dagster project in the past year?What are the ecosystem pressures that have shaped the ways that you think about the features and trajectory of Dagster as a project/product/community?In your recent release you introduced "components", which is a substantial change in how you enable teams to collaborate on data problems. What was the motivating factor in that work and how does it change the ways that organizations engage with their data?tension between being flexible and extensible vs. opinionated and constrainedincreased dependency on orchestration with LLM use casesreducing the barrier to contribution for data platform/pipelinesbringing application engineers into the mixchallenges of meeting users/teams where they are (languages, platform investments, etc.)What are the most interesting, innovative, or unexpected ways that you have seen teams applying the Components pattern?What are the most interesting, unexpected, or challenging lessons that you have learned while working on the latest iterations of Dagster?When is Dagster the wrong choice?What do you have planned for the future of Dagster?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links Dagster+ EpisodeDagster Components Slide DeckThe Rise Of Medium CodeLakehouse ArchitectureIcebergDagster ComponentsPydantic ModelsKubernetesDagster PipesRuby on RailsdbtSlingFivetranTemporalMCP == Model Context ProtocolThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

AI and the Lakehouse: How Starburst is Pioneering New Workflows

2025-06-11 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Alex Albu (Starburst)

AI/ML Analytics API Dashboard Data Contracts Data Engineering Data Lakehouse Data Management Data Quality Datafold Iceberg KPI +6 more

Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?What are the foundational architectural modifications that you had to make to enable those capabilities?For the vector storage and indexing, what modifications did you have to make to iceberg?What was your reasoning for not using a format like Lance?For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?When is Starburst/lakehouse the wrong choice for a given AI use case?What do you have planned for the future of AI on Starburst?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarburstPodcast EpisodeAWS AthenaMCP == Model Context ProtocolLLM Tool UseVector EmbeddingsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeStarburst Data ProductsLanceLanceDBParquetORCpgvectorStarburst IcehouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Real life, revealed: How behavioural data is shaping the future of business intelligence

2025-05-29 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by Chris Havemann (RealityMine) , Jason Foster (Cynozure)

AI/ML Analytics BI Data Management

Behavioural data is fast becoming a cornerstone of modern business strategy. Not just for media measurement or advertising optimisation, but across product, pricing, logistics, and platform development. It tells us what people actually do, not just what they say they do. As traditional market research struggles with low engagement and recall bias, brands are turning to digital behavioural data to make sharper, faster decisions. Whether it's tracking consumer journeys in the app economy or identifying early adoption trends (like the impact of AI tools on category disruption), the value lies in real, observable behaviour at scale. But, that shift raises new questions around data ownership, consent, and fairness. And, the rise of AI is only accelerating both the opportunity and the complexity. In the latest episode of Hub & Spoken, Jason Foster, CEO & Founder of Cynozure, speaks to Chris Havemann, CEO of RealityMine, and discusses everything from: The transition from survey-based research to behavioural data analysis The impact of AI on interpreting digital interactions Ethical considerations surrounding data consent and transparency Building trust through clear data collection and usage practices Learn from Chris's 25+ years in data and insight, and explore how behavioural signals are reshaping everything from media to market intelligence. ****  Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. It works with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and data leadership. The company was named one of The Sunday Times' fastest-growing private companies in both 2022 and 2023 and recognised as The Best Place to Work in Data by DataIQ in 2023 and 2024. Cynozure is a certified B Corporation.

#301 What the Like Button Tells You About Your Customers with Bob Goodson, Inventor of the Like Button

2025-05-13 · DataFramed Listen

podcast_episode

by Bob Goodson (Quid) , Richie (DataCamp)

AI/ML

The like button has transformed how we interact online, becoming a cornerstone of digital engagement with over 7 billion clicks daily. What started as a simple user interface solution has evolved into a powerful data collection tool that companies use to understand customer preferences, predict trends, and build sophisticated recommendation systems. The data behind these interactions forms what experts call the 'like graph' - a valuable network of connections that might be one of your company's most underutilized assets. Bob Goodson is President and Founder of Quid, a Silicon Valley–based company whose AI models are used by a third of the Fortune 50. Before starting Quid, he was the first employee at Yelp, where he played a role in the genesis of the like button and observed firsthand the rise of the social media industry. After Quid received an award in 2016 from the World Economic Forum for “Contributions to the Future of the Internet,” Bob served a two-year term on WEF’s Global Future Council for Artificial Intelligence & Robotics. While at Oxford University doing graduate research in language theory, Bob co-founded Oxford Entrepreneurs to connect scientists with business-minded students. Bob is co-author of a new book, Like: The Button That Changed the World, focussed on the origins of the ubiquitous Like Button in social media. In the episode, Richie and Bob explore the origins of the like button, its impact on user interaction and business, the evolution of social media features, the significance of relational data, and the future of social networks in the age of AI, and much more. Links Mentioned in the Show: Bob’s book—Like: The Button That Changed the WorldConnect with BobCourse: Analyzing Social Media Data in PythonRelated Episode: How I Nearly Got Fired For Running An A/B Test with Vanessa Larco, Former Partner at New Enterprise AssociatesRewatch sessions from RADAR: Skills Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Sarah McKenna - Web Scraping Explained

2025-04-29 · The Joe Reis Show Listen

podcast_episode

by Sarah McKenna , Joe Reis (DeepLearning.AI)

AI/ML Data Quality

Sarah McKenna joins me to chat about all things web scraping. We discuss its applications, the evolution of alternative data, and AI's impact on the industry. We also discuss privacy concerns, the challenges of bot blocking, and the importance of data quality. Sarah shares ideas on how to get started with web scraping and the ethical considerations surrounding copyright and data collection.

#265: Connected Wellness in the Age of AI with Michael Tiffany

2025-02-18 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Michael Tiffany (Fulcra Dynamics) , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

AI/ML

Every listener of this show is keenly aware that they are enabling the collection of various forms of hyper-specific data. Smartphones are movement and light biometric data collection machines. Many of us augment this data with a smartwatch, a smart ring, or both. A connected scale? Sure! Maybe even a continuous glucose monitor (CGM)! But… why? And what are the ramifications both for changing the ways we move through life for the better (Live healthier! Proactive wellness!) and for the worse (privacy risks and bad actors)? We had a wide-ranging discussion with Michael Tiffany, co-founder and CEO of Fulcra Dynamics, that took a run at these topics and more. Why, it's possible you'll get so excited by the content that one of your devices will record a temporary spike in your heart rate! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

AI That Learns You—Without Spying on You

2025-01-30 · Data & AI with Mukundan | Learn AI by Building Listen

podcast_episode

by Mukundan Sankar

AI/ML Data Streaming

Episode Description Ever feel like your phone knows you a little too well? One Google search, and suddenly, ads follow you across the internet like a digital stalker. AI-powered personalization has long relied on collecting massive amounts of personal data—but what if it didn’t have to? In this episode of Data & AI with Mukundan, we explore a game-changing shift in AI—personalized experiences without intrusive tracking. Two groundbreaking techniques, Sequential Layer Expansion and FedSelect, are reshaping how AI learns from users while keeping their data private. We’ll break down: ✅ Why AI personalization has been broken until now ✅ How these new models improve AI recommendations without privacy risks ✅ Real-world applications in streaming, e-commerce, and healthcare ✅ How AI can respect human identity while scaling globally The future of AI is personal, but it doesn’t have to be invasive. Tune in to discover how AI can work for you—without spying on you. Key Takeaways 🔹 The Problem: Why AI Personalization Has Been Broken Streaming services, e-commerce, and healthcare AI often make irrelevant or generic recommendations.Most AI models collect massive amounts of user data, stored on centralized servers—risking leaks, breaches, and misuse.AI personalization has been a “one-size-fits-all” approach that doesn’t truly adapt to individual needs.🔹 The Solution: AI That Learns Without Spying on You ✨ Sequential Layer Expansion – AI that grows with you Instead of static AI models, this method builds in layers, adapting over time.It learns only what’s relevant to you, reducing unnecessary data collection.Think of it like training for a marathon—starting small and progressively improving.✨ FedSelect – AI that fine-tunes only what matters Instead of changing an entire AI model, it selectively updates the most relevant parameters.Think of it like tuning a car—you upgrade what’s needed instead of replacing the whole engine.Everything happens locally on your device, meaning your raw data never leaves.🔹 Real-World Impact: How This Changes AI for You 🎬 Streaming Services – Netflix finally gets your taste right—without tracking you across the web. 🛍️ E-commerce – Shopping apps suggest what you actually need, not random trending items. 🏥 Healthcare – AI-powered health plans tailored to your genes and habits—without sharing your medical data. 🔹 The Bigger Picture: Why This Matters for the Future of AI Personalized AI at scale: AI adapts to billions of users while remaining privacy-first.AI that respects human identity: You control your AI, not the other way around.The end of surveillance-style tracking: No more creepy ads following you around.🌟 AI can be personal—without being invasive. That’s the future we should all demand. Fedselect: https://arxiv.org/abs/2404.02478 | Sequential Layer Expansion:https://arxiv.org/abs/2404.17799 🔔 Subscribe, rate, and review for more AI insights!

#259 Getting the Data For Your Data-Driven Decisions with Jonathan Bloch & Scott Voigt

2024-11-07 · DataFramed Listen

podcast_episode

by Scott Voigt (Fullstory) , Jonathan Bloch (Exchange Data International (EDI)) , Richie (DataCamp)

AI/ML Data Management IBM Marketing

We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here. Understanding where the data you use comes from, how to use it responsibly, and how to maximize its value has become essential. But as data sources multiply, so do the complexities around data privacy, customization, and ownership. How can companies capture and leverage the right data to create meaningful customer experiences while respecting privacy? And as data drives more personalized interactions, what steps can businesses take to protect sensitive information and navigate the increasingly complex regulatory picture? Jonathan Bloch is CEO at Exchange Data International (EDI) and a seasoned businessman with 40 years experience in information provision. He started work in the newsletter industry and ran the US subsidiary of a UK public company before joining its main board as head of its publishing division. He has been a director and/or chair of several companies and is currently a non executive director of an FCA registered investment bank. In 1994 he founded Exchange Data International (EDI) a London based financial data provider. EDI now has over 450 clients across three continents and is based in the UK, USA, India and Morocco employing 500 people. Scott Voigt is CEO and co-founder at Fullstory. Scott has enjoyed helping early-stage software businesses grow since the mid 90s, when he helped launch and take public nFront—one of the world's first Internet banking service providers. Prior to co-founding Fullstory, Voigt led marketing at Silverpop before the company was acquired by IBM. Previously, he worked at Noro-Moseley Partners, the Southeast's largest Venture firm, and also served as COO at Innuvo, which was acquired by Google. Scott teamed up with two former Innuvo colleagues, and the group developed the earliest iterations of Fullstory to understand how an existing product was performing. It was quickly apparent that this new platform provided the greatest value—and the rest is history. In the episode, Richie, Jonathan and Scott explore first-party vs third-party data, protecting corporate data, behavioral data, personalization, data sourcing strategies, platforms for storage and sourcing, data privacy, synthetic data, regulations and compliance, the future of data collection and storage, and much more. Links Mentioned in the Show: FullstoryExchange Data InternationalConnect with Jonathan and ScottCourse: Understanding GDPRRelated Episode: How Data and AI are Changing Data Management with Jamie Lerner, CEO, President, and Chairman at QuantumSign up to RADAR: Forward Edition New to DataCamp? Learn on the go using the DataCamp mobile...

Using Data to Create Liveable Cities - Rachel Lim

2024-11-01 · DataTalks.Club Listen

podcast_episode

by Rachel Lim

AI/ML Data Engineering Data Science GenAI HTML SQL

We talked about:

00:00 DataTalks.Club intro 01:56 Using data to create livable cities 02:52 Rachel's career journey: from geography to urban data science 04:20 What does a transport scientist do? 05:34 Short-term and long-term transportation planning 06:14 Data sources for transportation planning in Singapore 08:38 Rachel's motivation for combining geography and data science 10:19 Urban design and its connection to geography 13:12 Defining a livable city 15:30 Livability of Singapore and urban planning 18:24 Role of data science in urban and transportation planning 20:31 Predicting travel patterns for future transportation needs 22:02 Data collection and processing in transportation systems 24:02 Use of real-time data for traffic management 27:06 Incorporating generative AI into data engineering 30:09 Data analysis for transportation policies 33:19 Technologies used in text-to-SQL projects 36:12 Handling large datasets and transportation data in Singapore 42:17 Generative AI applications beyond text-to-SQL 45:26 Publishing public data and maintaining privacy 45:52 Recommended datasets and projects for data engineering beginners 49:16 Recommended resources for learning urban data science

About the speaker:

Rachel is an urban data scientist dedicated to creating liveable cities through the innovative use of data. With a background in geography, and a masters in urban data science, she blends qualitative and quantitative analysis to tackle urban challenges. Her aim is to integrate data driven techniques with urban design to foster sustainable and equitable urban environments.

Links: - https://datamall.lta.gov.sg/content/datamall/en/dynamic-data.html

00:00 DataTalks.Club intro 01:56 Using data to create livable cities 02:52 Rachel's career journey: from geography to urban data science 04:20 What does a transport scientist do? 05:34 Short-term and long-term transportation planning 06:14 Data sources for transportation planning in Singapore 08:38 Rachel's motivation for combining geography and data science 10:19 Urban design and its connection to geography 13:12 Defining a livable city 15:30 Livability of Singapore and urban planning 18:24 Role of data science in urban and transportation planning 20:31 Predicting travel patterns for future transportation needs 22:02 Data collection and processing in transportation systems 24:02 Use of real-time data for traffic management 27:06 Incorporating generative AI into data engineering 30:09 Data analysis for transportation policies 33:19 Technologies used in text-to-SQL projects 36:12 Handling large datasets and transportation data in Singapore 42:17 Generative AI applications beyond text-to-SQL 45:26 Publishing public data and maintaining privacy 45:52 Recommended datasets and projects for data engineering beginners 49:16 Recommended resources for learning urban data science

Join our slack: https: //datatalks.club/slack.html

Book Review: The Privacy Engineer’s Manifesto

2024-10-05 · Deep Dive into Data Privacy Listen

podcast_episode

Ever feel like you're clicking "agree" online without knowing what's happening behind the scenes? We break down the Privacy Engineer's Manifesto and how it aims to build real-world data protection into the systems we use everyday. From minimizing data collection to respecting user control, we explore what's needed to make privacy a reality, not just a promise.

#253: Adopting a Just In Time, Just Enough Data Mindset with Matt Gershoff

2024-09-03 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Matt Gershoff (Conductrics, New York - USA) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

While we don't often call it out explicitly, the driving force behind much of what and how much data we collect is driven by a "just in case" mentality: we don't know exactly HOW that next piece of data will be put to use, but we better collect it to minimize the potential for future regret about NOT collecting it. Data collection is an optionality play—we strive to capture "all the data" so that we have as many potential options as possible for how it gets crunched somewhere down the road. On this episode, we explored the many ways this deeply ingrained and longstanding mindset is problematic, and we were joined by the inimitable Matt Gershoff from Conductrics for the discussion! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Build Your Second Brain One Piece At A Time

2024-04-28 · Data Engineering Podcast Listen

podcast_episode

by Tsavo Knott (Pieces) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Delta GenAI Hudi Iceberg Python +3 more

Summary Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Tsavo Knott about Pieces, a personal AI toolkit to improve the efficiency of developersInterview IntroductionHow did you get involved in machine learning?Can you describe what Pieces is and the story behind it?The past few months have seen an endless series of personalized AI tools launched. What are the features and focus of Pieces that might encourage someone to use it over the alternatives?model selectionsarchitecture of Pieces applicationlocal vs. hybrid vs. online modelsmodel update/delivery processdata preparation/serving for models in context of Pieces appapplication of AI to developer workflowstypes of workflows that people are building with piecesWhat are the most interesting, innovative, or unexpected ways that you have seen Pieces used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pieces?When is Pieces the wrong choice?What do you have planned for the future of Pieces?Contact Info LinkedInParting Question From your perspective, what is the biggest barrier to adoption of machine learning today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.Links PiecesNPU == Neural Processing UnitTensor ChipLoRA == Low Rank AdaptationGenerative Adversarial NetworksMistralEmacsVimNeoVimDartFlutte

#174 The Future of Marketing Analytics with Cory Munchbach, CEO at BlueConic

2024-01-18 · DataFramed Listen

podcast_episode

by Richie (DataCamp) , Cory Munchbach (BlueConic)

Analytics GDPR/CCPA Marketing

Cookies were invented to help online shoppers, simply as an identifier so that online carts weren’t lost to the ether. Marketers quickly saw the power of using cookies for more than just maintaining session states, and moved to use them as part of their targeted advertising. Before we knew it, our online habits were being tracked, without our clear consent. The unregulated cookie-boom lasted until 2018 with the advent of GDPR and the CCPA. Since then marketers have been evolving their practices, looking for alternatives to cookie-tracking that will perform comparatively, and with the cookie being phased out in 2024, technologies like fingerprinting and new privacy-centric marketing strategies will play a huge role in how products meet users in the future. Cory Munchbach has spent her career on the cutting edge of marketing technology and brings years working with Fortune 500 clients from various industries to BlueConic. Prior to BluConic, she was an analyst at Forrester Research where she covered business and consumer technology trends and the fast-moving marketing tech landscape. A sought-after speaker and industry voice, Cory’s work has been featured in Financial Times, Forbes, Raconteur, AdExchanger, The Drum, Venture Beat, Wired, AdAge, and Adweek. A life-long Bostonian, Cory has a bachelor’s degree in political science from Boston College and spends a considerable amount of her non-work hours on various volunteer and philanthropic initiatives in the greater Boston community. In the episode, Richie and Cory cover successful marketing strategies and their use of data, the types of data used in marketing, how data is leveraged during different stages of the customer life cycle, the impact of privacy laws on data collection and marketing strategies, tips on how to use customer data while protecting privacy and adhering to regulations, the importance of data skills in marketing, the future of marketing analytics and much more. Links Mentioned in the Show: BlueConicMattel CreationsGoogle: Prepare for third-party cookie restrictionsData Clean Rooms[Course] Marketing Analytics for Business

Using Data To Illuminate The Intentionally Opaque Insurance Industry

2023-10-09 · Data Engineering Podcast Listen

podcast_episode

by Max Cho (CoverageCat) , Tobias Macey

AI/ML Analytics BI CI/CD Cloud Computing Data Engineering Data Management Data Quality Data Science Datafold dbt Modern Data Stack +4 more

Summary

The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES. You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Max Cho about the wild world of insurance companies and the challenges of collecting quality data for this opaque industry

Interview

Introduction How did you get involved in the area of data management? Can you describe what CoverageCat is and the story behind it? What are the different sources of data that you work with?

What are the most challenging aspects of collecting that data? Can you describe the formats and characteristics (3 Vs) of that data?

What are some of the ways that the operational model of insurance companies have contributed to its opacity as an industry from a data perspective? Can you describe how you have architected your data platform?

How have the design and goals changed since you first started working on it? What are you optimizing for in your selection and implementation process?

What are the sharp edges/weak points that you worry about in your existing data flows?

How do you guard against those flaws in your day-to-day operations?

What are the

#143 Fighting the Climate Crisis with Data

2023-06-26 · DataFramed Listen

podcast_episode

by Jean-Pierre Pélicier (ENGIE)

Analytics

Every year we become increasingly aware of the urgency of the climate crisis, and with that, the need to usher in renewable energies and scale their adoption has never been more important. However, as we look at the ways to scale the adoption of renewable energy, data stands out as a key lever to accelerate a greener future. Today’s guest is Jean-Pierre Pélicier, CDO at ENGIE. ENGIE is one of the largest energy producers in the world and definitely one of the largest in Europe. They operate in more than 48 countries and have committed to becoming carbon neutral by 2045. Data plays a crucial part in these plans. In the episode, Jean-Pierre shares his unique perspective on how data is not just transforming the renewable energy industry but also redefining the way we approach the climate crisis. From harnessing the power of data to optimize energy production and distribution to leveraging advanced analytics to predict and mitigate environmental impacts, Jean-Pierre highlights the ways data continues to be an invaluable tool in our quest for a sustainable future. Also discussed in the episode are the challenges of data collection and quality in the energy sector, the importance of fostering a data culture within an organization, and aligning data strategy with a company's strategic objectives.

#134 Building Great Machine Learning Products at Opendoor

2023-04-17 · DataFramed Listen

podcast_episode

by Sam Stone (Opendoor)

AI/ML Data Science SaaS

Building machine learning systems with high predictive accuracy is inherently hard, and embedding these systems into great product experiences is doubly so. To build truly great machine learning products that reach millions of users, organizations need to marry great data science expertise, with strong attention to user experience, design thinking, and a deep consideration for the impacts of your prediction on users and stakeholders. So how do you do that? Today’s guest is Sam Stone, Director of Product Management, Pricing & Data at Opendoor, a real-estate technology company that leverages machine learning to streamline the home buying and selling process. Sam played an integral part in developing AI/ML products related to home pricing including the Opendoor Valuation Model (OVM), market liquidity forecasting, portfolio optimization, and resale decision tooling. Prior to Opendoor, he was a co-founder and product manager at Ansaro, a SaaS startup using data science and machine learning to help companies improve hiring decisions. Sam holds degrees in Math and International Relations from Stanford and an MBA from Harvard. Throughout the episode, we spoke about his principles for great ML product design, how to think about data collection for these types of products, how to package outputs from a model within a slick user interface, what interpretability means in the eyes of customers, how to be proactive about monitoring failure points, and much more.

#116 Value Creation Within the Modern Data Stack

2022-12-05 · DataFramed Listen

podcast_episode

by Yali Sassoon (Snowplow)

Analytics Modern Data Stack Snowplow

With the increasing rate at which new data tools and platforms are being created, the modern data stack risks becoming just another buzzword data leaders use when talking about how they solve problems.

Alongside the arrival of new data tools is the need for leaders to see beyond just the modern data stack and think deeply about how their data work can align with business outcomes, otherwise, they risk falling behind trying to create value from innovative, but irrelevant technology.

In this episode, Yali Sassoon joins the show to explore what the modern data stack really means, how to rethink the modern data stack in terms of value creation, data collection versus data creation, and the right way businesses should approach data ingestion, and much more.

Yali is the Co-Founder and Chief Strategy Officer at Snowplow Analytics, a behavioral data platform that empowers data teams to solve complex data challenges. Yali is an expert in data with a background in both strategy and operations consulting teaching companies how to use data properly to evolve their operations and improve their results.

The role of geospatial data soldiers in the military with George McCrea

2022-10-13 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by George McCrea (Royal Engineers Geographic) , Jason Foster (Cynozure)

In this episode, Jason Foster talks to George McCrea, Chief of Staff at Royal Engineers Geographic. They explore the fascinating role of geospatial data soldiers in the military, how the military's experience with data collection and analysis has developed over centuries and how it's used to improve the lives of civilians in a variety of ways and to support government bodies to make strategic decisions.

talk-data.com

Activity Trend

Top Events

Top Speakers

Foundational Data Engineering At Two Sigma

Enabling Agents In The Enterprise With A Platform Approach

Dagster's New Era: Modularizing Data Transformation in the Age of AI

AI and the Lakehouse: How Starburst is Pioneering New Workflows

Real life, revealed: How behavioural data is shaping the future of business intelligence

#301 What the Like Button Tells You About Your Customers with Bob Goodson, Inventor of the Like Button

Sarah McKenna - Web Scraping Explained

#265: Connected Wellness in the Age of AI with Michael Tiffany

AI That Learns You—Without Spying on You

#259 Getting the Data For Your Data-Driven Decisions with Jonathan Bloch & Scott Voigt

Using Data to Create Liveable Cities - Rachel Lim

Book Review: The Privacy Engineer’s Manifesto

#253: Adopting a Just In Time, Just Enough Data Mindset with Matt Gershoff

Build Your Second Brain One Piece At A Time

#174 The Future of Marketing Analytics with Cory Munchbach, CEO at BlueConic

Using Data To Illuminate The Intentionally Opaque Insurance Industry

#143 Fighting the Climate Crisis with Data

#134 Building Great Machine Learning Products at Opendoor

#116 Value Creation Within the Modern Data Stack

The role of geospatial data soldiers in the military with George McCrea