Python

Week 3 Focus: Intro to Modeling

2025-05-31 · Build & Learn: Data Science with Coffee

workshop

by Lindsey

Now that your data is cleaned and explored, it’s time to build your first model. This week is all about making something that works — it doesn’t have to be perfect. You’ll train a basic model (regression or classification) and evaluate performance using simple metrics, then start thinking about iteration and improvement.

Creating Tools and Working with Data in Finance (w/ Matt Dancho)

2025-05-30 · Mavens of Data Listen

podcast_episode

by Matt Dancho (Business Science and Quant Science)

AI/ML Analytics Data Science

What happens when a passion for data science meets the fast-paced world of stock trading? In this episode, we're joined by Matt Dancho, Founder of Business Science, Quant Science, and the creator of the popular tidyquant package, who shares his journey from data scientist to launching Business Science and the projects and packages he's built along the way. We explore how he leverages Python and R to trade stocks, as well as the lessons he's learned from building a business in the data. Whether you're a data professional, a stock market enthusiast, or an aspiring entrepreneur, this episode is packed with actionable insights to level up your skills and strategies. What You'll Learn: Taking your data skills and furthering your career to make money. Tips for working with stock data. How AI is changing analytics Register for free to be part of the next live session: https://bit.ly/3XB3A8b Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter

Mathematics of Machine Learning

2025-05-30 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Tivadar Danka

AI/ML ai-ml data machine-learning

In "Mathematics of Machine Learning," you will explore the foundational mathematics essential for understanding and advancing in machine learning. The book covers linear algebra, calculus, and probability theory, offering readers clear explanations and practical Python-based implementations. What this Book will help me do Master fundamental linear algebra concepts such as matrices, eigenvalues, and vector spaces. Understand and apply principles of calculus, including multivariable functions and optimization. Gain confidence in utilizing probability theory concepts like Bayes' theorem and random distributions. Learn to implement mathematical concepts in Python to solve machine learning problems. Bridge the gap between theoretical mathematics and the practical demands of modern machine learning. Author(s) Tivadar Danka is a PhD mathematician with a specialized focus on machine learning applications. Known for his clear and engaging teaching style, Tivadar has a deep understanding of both mathematical rigor and practical ML challenges. His ability to break down complex ideas into comprehensible concepts has helped him reach thousands of learners globally. Who is it for? The book is perfect for data scientists, aspiring machine learning engineers, software developers working with ML, and researchers interested in advanced ML methodologies. If you have a basic understanding of algebra and Python programming, alongside some familiarity with machine learning concepts, this book will help you deepen your mathematical insight and elevate your practical applications.

Scaling Data Operations With Platform Engineering

2025-05-29 · Data Engineering Podcast Listen

podcast_episode

by Chakri Kotaru , Tobias Macey

AI/ML Analytics AWS Dashboard Data Contracts Data Engineering Data Management Data Quality Datafold DevOps KPI NoSQL +1 more

Summary In this episode of the Data Engineering Podcast Chakravarthy Kotaru talks about scaling data operations through standardized platform offerings. From his roots as an Oracle developer to leading the data platform at a major online travel company, Chakravarthy shares insights on managing diverse database technologies and providing databases as a service to streamline operations. He explains how his team has transitioned from DevOps to a platform engineering approach, centralizing expertise and automating repetitive tasks with AWS Service Catalog. Join them as they discuss the challenges of migrating legacy systems, integrating AI and ML for automation, and the importance of organizational buy-in in driving data platform success.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and today I'm interviewing Chakri Kotaru about scaling successful data operations through standardized platform offeringsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the different ways that you have seen teams you work with fail due to lack of structure and opinionated design?Why NoSQL?Pairing different styles of NoSQL for different problemsUseful patterns for each NoSQL style (document, column family, graph, etc.)Challenges in platform automation and scaling edge casesWhat challenges do you anticipate as a result of the new pressures as a result of AI applications?What are the most interesting, innovative, or unexpected ways that you have seen platform engineering practices applied to data systems?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform engineering?When is NoSQL the wrong choice?What do you have planned for the future of platform principles for enabling data teams/data applications?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links RiakDynamoDBSQL ServerCassandraScyllaDBCAP TheoremTerraformAWS Service CatalogBlog PostThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Without Labels

2025-05-26 · O'Reilly Data Science Books O'Reilly Amazon

book

by Vaibhav Verdhan

AI/ML Data Science GenAI Keras Matplotlib NumPy Pandas Seaborn TensorFlow data data-science data-science-tools

Discover all-practical implementations of the key algorithms and models for handling unlabeled data. Full of case studies demonstrating how to apply each technique to real-world problems. In Data Without Labels you’ll learn: Fundamental building blocks and concepts of machine learning and unsupervised learning Data cleaning for structured and unstructured data like text and images Clustering algorithms like K-means, hierarchical clustering, DBSCAN, Gaussian Mixture Models, and Spectral clustering Dimensionality reduction methods like Principal Component Analysis (PCA), SVD, Multidimensional scaling, and t-SNE Association rule algorithms like aPriori, ECLAT, SPADE Unsupervised time series clustering, Gaussian Mixture models, and statistical methods Building neural networks such as GANs and autoencoders Dimensionality reduction methods like Principal Component Analysis and multidimensional scaling Association rule algorithms like aPriori, ECLAT, and SPADE Working with Python tools and libraries like sci-kit learn, numpy, Pandas, matplotlib, Seaborn, Keras, TensorFlow, and Flask How to interpret the results of unsupervised learning Choosing the right algorithm for your problem Deploying unsupervised learning to production Maintenance and refresh of an ML solution Data Without Labels introduces mathematical techniques, key algorithms, and Python implementations that will help you build machine learning models for unannotated data. You’ll discover hands-off and unsupervised machine learning approaches that can still untangle raw, real-world datasets and support sound strategic decisions for your business. Don’t get bogged down in theory—the book bridges the gap between complex math and practical Python implementations, covering end-to-end model development all the way through to production deployment. You’ll discover the business use cases for machine learning and unsupervised learning, and access insightful research papers to complete your knowledge. About the Technology Generative AI, predictive algorithms, fraud detection, and many other analysis tasks rely on cheap and plentiful unlabeled data. Machine learning on data without labels—or unsupervised learning—turns raw text, images, and numbers into insights about your customers, accurate computer vision, and high-quality datasets for training AI models. This book will show you how. About the Book Data Without Labels is a comprehensive guide to unsupervised learning, offering a deep dive into its mathematical foundations, algorithms, and practical applications. It presents practical examples from retail, aviation, and banking using fully annotated Python code. You’ll explore core techniques like clustering and dimensionality reduction along with advanced topics like autoencoders and GANs. As you go, you’ll learn where to apply unsupervised learning in business applications and discover how to develop your own machine learning models end-to-end. What's Inside Master unsupervised learning algorithms Real-world business applications Curate AI training datasets Explore autoencoders and GANs applications About the Reader Intended for data science professionals. Assumes knowledge of Python and basic machine learning. About the Author Vaibhav Verdhan is a seasoned data science professional with extensive experience working on data science projects in a large pharmaceutical company. Quotes An invaluable resource for anyone navigating the complexities of unsupervised learning. A must-have. - Ganna Pogrebna, The Alan Turing Institute Empowers the reader to unlock the hidden potential within their data. - Sonny Shergill, Astra Zeneca A must-have for teams working with unstructured data. Cuts through the fog of theory ili Explains the theory and delivers practical solutions. - Leonardo Gomes da Silva, onGRID Sports Technology The Bible for unsupervised learning! Full of real-world applications, clear explanations, and excellent Python implementations. - Gary Bake, Falconhurst Technologies

From Data Discovery to AI: The Evolution of Semantic Layers

2025-05-21 · Data Engineering Podcast Listen

podcast_episode

by Shinji Kim (Select Star) , Tobias Macey

AI/ML BI Data Engineering Data Management Databricks Datafold DWH LLM Modern Data Stack Snowflake

Summary In this episode of the Data Engineering Podcast, host Tobias Macy welcomes back Shinji Kim to discuss the evolving role of semantic layers in the era of AI. As they explore the challenges of managing vast data ecosystems and providing context to data users, they delve into the significance of semantic layers for AI applications. They dive into the nuances of semantic modeling, the impact of AI on data accessibility, and the importance of business logic in semantic models. Shinji shares her insights on how SelectStar is helping teams navigate these complexities, and together they cover the future of semantic modeling as a native construct in data systems. Join them for an in-depth conversation on the evolving landscape of data engineering and its intersection with AI.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Shinji Kim about the role of semantic layers in the era of AIInterview IntroductionHow did you get involved in the area of data management?Semantic modeling gained a lot of attention ~4-5 years ago in the context of the "modern data stack". What is your motivation for revisiting that topic today?There are several overlapping concepts – "semantic layer," "metrics layer," "headless BI." How do you define these terms, and what are the key distinctions and overlaps?Do you see these concepts converging, or do they serve distinct long-term purposes?Data warehousing and business intelligence have been around for decades now. What new value does semantic modeling beyond practices like star schemas, OLAP cubes, etc.?What benefits does a semantic model provide when integrating your data platform into AI use cases?How is it different between using AI as an interface to your analytical use cases vs. powering customer facing AI applications with your data?Putting in the effort to create and maintain a set of semantic models is non-zero. What role can LLMs play in helping to propose and construct those models?For teams who have already invested in building this capability, what additional context and metadata is necessary to provide guidance to LLMs when working with their models?What's the most effective way to create a semantic layer without turning it into a massive project? There are several technologies available for building and serving these models. What are the selection criteria that you recommend for teams who are starting down this path?What are the most interesting, innovative, or unexpected ways that you have seen semantic models used?What are the most interesting, unexpected, or challenging lessons that you have learned while working with semantic modeling?When is semantic modeling the wrong choice?What do you predict for the future of semantic modeling?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SelectStarSun MicrosystemsMarkov Chain Monte CarloSemantic ModelingSemantic LayerMetrics LayerHeadless BICubePodcast EpisodeAtScaleStar SchemaData VaultOLAP CubeRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeKNN == K-Nearest NeighbersHNSW == Hierarchical Navigable Small Worlddbt Metrics LayerSoda DataLookMLHexPowerBITableauSemantic View (Snowflake)Databricks GenieSnowflake Cortex AnalystMalloyThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The test automation toolbox: exploring frameworks built on WebDriver — the API for browser automation

2025-05-14 · The test automation toolbox: exploring frameworks built on WebDriver

talk

Java JavaScript Selenium c# ruby webdriver

This session explores the power and versatility of WebDriver, the standard API for browser automation, and its broad adoption across programming languages. We’ll dive into the WebDriver ecosystem, examining open-source frameworks built in Java, C#, Ruby, Python, and JavaScript. Attendees will gain insights into how these frameworks leverage WebDriver, the challenges of cross-language implementation, and best practices for choosing the right tool. We’ll also assess framework health using key GitHub metrics and discuss ways to contribute to the open-source automation community. Whether you’re a tester, developer, or QA engineer, this talk will help you navigate the test automation toolbox more effectively.

Balancing Off-the-Shelf and Custom Solutions in Data Engineering

2025-05-13 · Data Engineering Podcast Listen

podcast_episode

by Tulika Bhatt (Netflix) , Tobias Macey

AI/ML Flink Data Engineering Data Management Data Quality Datafold GenAI Iceberg Spark

Summary In this episode of the Data Engineering Podcast Tulika Bhatt, a senior software engineer at Netflix, talks about her experiences with large-scale data processing and the future of data engineering technologies. Tulika shares her journey into the data engineering field, discussing her work at BlackRock and Verizon before joining Netflix, and explains the challenges and innovations involved in managing Netflix's impression data for personalization and user experience. She highlights the importance of balancing off-the-shelf solutions with custom-built systems using technologies like Spark, Flink, and Iceberg, and delves into the complexities of ensuring data quality and observability in high-speed environments, including robust alerting strategies and semantic data auditing.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Tulika Bhatt about her experiences working on large scale data processing and her insights on the future trajectory of the supporting technologiesInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the ways that operating at large scale change the ways that you need to think about the design of data systems?When dealing with small-scale data systems it can be feasible to have manual processes. What are the elements of large scal data systems that demand autopmation?How can those large-scale automation principles be down-scaled to the systems that the rest of the world are operating?A perennial problem in data engineering is that of data quality. The past 4 years has seen a significant growth in the number of tools and practices available for automating the validation and verification of data. In your experience working with high volume data flows, what are the elements of data validation that are still unsolved?Generative AI has taken the world by storm over the past couple years. How has that changed the ways that you approach your daily work?What do you see as the future realities of working with data across various axes of large scale, real-time, etc.?What are the most interesting, innovative, or unexpected ways that you have seen solutions to large-scale data management designed?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data management across axes of scale?What are the ways that you are thinking about the future trajectory of your work??Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BlackRockSparkFlinkKafkaCassandraRocksDBNetflix Maestro workflow orchestratorPagerdutyIcebergThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Structured Simplicity: Using Pydantic for Data Modeling and Type Safety in Python

2025-05-13 · May Members Talk Evening

talk

by Emin Mastizada (DeliveryHero)

Pydantic

Introduction to type safety and data modeling in Python, making querying and results more reliable and managed for LLM models.

ScaleDown

2025-05-13 · May Members Talk Evening

talk

by Archana Vaidheeswaran (Aleph Alpha)

chrome extension open source prompt compression

In this talk, we're excited to show you how we built ScaleDown, a Chrome extension that makes your AI interactions more efficient and environmentally sustainable using prompt compression! As more people use AI tools such as ChatGPT, Claude, and Gemini instead of Google Search, few realize the massive carbon footprint each interaction generates. We will talk about our journey from recognizing this hidden environmental cost to creating a solution that is helping users reduce their AI-related emissions by up to 80%. Finally, we'll share how developers can contribute to our open-source Python package powering ScaleDown's prompt compression. Whether you're interested in improving our compression algorithms, enhancing our emissions calculation methodology, or expanding compatibility with additional AI models, we'll show you how you can get involved!

I Built an AI Tool That Turns Any Python Code Into an Emotionally Engaging Blog Post

2025-05-09 · Data & AI with Mukundan | Learn AI by Building Listen

podcast_episode

by Mukundan Sankar

AI/ML LLM

Takeaways Code2Story Pro turns Python code into engaging blog posts. Traditional documentation methods are often insufficient. Effective communication of code is crucial for collaboration. The tool allows users to select tone and emotion for their writing. Mukund built the tool out of frustration with documentation. The technical setup involves Streamlit and OpenAI's GPT-4. Users can generate blog posts in under 30 seconds. Future updates will include file uploads and image generation. The tool is aimed at helping developers share their work easily. Storytelling in coding can enhance career opportunities.

Blog: https://medium.com/data-science-collective/i-built-an-ai-tool-that-turns-any-python-code-into-an-emotionally-engaging-blog-post-f6d14daeddbd

Website: Subscribe for free access to the code: https://mukundansankar.substack.com/

#300 End to End AI Application Development with Maxime Labonne, Head of Post-training at Liquid AI & Paul-Emil Iusztin, Founder at Decoding ML

2025-05-05 · DataFramed Listen

podcast_episode

by Maxime Labonne (Liquid AI) , Richie (DataCamp) , Paul-Emil Iusztin (Decoding ML)

AI/ML Data Quality GenAI GitHub LLM

The roles within AI engineering are as diverse as the challenges they tackle. From integrating models into larger systems to ensuring data quality, the day-to-day work of AI professionals is anything but routine. How do you navigate the complexities of deploying AI applications? What are the key steps from prototype to production? For those looking to refine their processes, understanding the full lifecycle of AI development is essential. Let's delve into the intricacies of AI engineering and the strategies that lead to successful implementation. Maxime Labonne is a Senior Staff Machine Learning Scientist at Liquid AI, serving as the head of post-training. He holds a Ph.D. in Machine Learning from the Polytechnic Institute of Paris and is recognized as a Google Developer Expert in AI/ML. An active blogger, he has made significant contributions to the open-source community, including the LLM Course on GitHub, tools such as LLM AutoEval, and several state-of-the-art models like NeuralBeagle and Phixtral. He is the author of the best-selling book “Hands-On Graph Neural Networks Using Python,” published by Packt. Paul-Emil Iusztin designs and implements modular, scalable, and production-ready ML systems for startups worldwide. He has extensive experience putting AI and generative AI into production. Previously, Paul was a Senior Machine Learning Engineer at Metaphysic.ai and a Machine Learning Lead at Core.ai. He is a co-author of The LLM Engineer's Handbook, a best seller in the GenAI space. In the episode, Richie, Maxime, and Paul explore misconceptions in AI application development, the intricacies of fine-tuning versus few-shot prompting, the limitations of current frameworks, the roles of AI engineers, the importance of planning and evaluation, the challenges of deployment, and the future of AI integration, and much more. Links Mentioned in the Show: Maxime’s LLM Course on HuggingFaceMaxime and Paul’s Code Alongs on DataCampDecoding ML on SubstackConnect with Maxime and PaulSkill Track: AI FundamentalsRelated Episode: Building Multi-Modal AI Applications with Russ d'Sa, CEO & Co-founder of LiveKitRewatch sessions from RADAR: Skills Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

StarRocks: Bridging Lakehouse and OLAP for High-Performance Analytics

2025-05-05 · Data Engineering Podcast Listen

podcast_episode

by Sida Shen (CelerData) , Tobias Macey

AI/ML Analytics ClickHouse Data Engineering Data Lakehouse Data Management Datafold Delta Dremio Hudi Iceberg Presto +1 more

Summary In this episode of the Data Engineering Podcast Sida Shen, product manager at CelerData, talks about StarRocks, a high-performance analytical database. Sida discusses the inception of StarRocks, which was forked from Apache Doris in 2020 and evolved into a high-performance Lakehouse query engine. He explains the architectural design of StarRocks, highlighting its capabilities in handling high concurrency and low latency queries, and its integration with open table formats like Apache Iceberg, Delta Lake, and Apache Hudi. Sida also discusses how StarRocks differentiates itself from other query engines by supporting on-the-fly joins and eliminating the need for denormalization pipelines, and shares insights into its use cases, such as customer-facing analytics and real-time data processing, as well as future directions for the platform.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Sida Shen about StarRocks, a high performance analytical database supporting shared nothing and shared data patternsInterview IntroductionHow did you get involved in the area of data management?Can you describe what StarRocks is and the story behind it?There are numerous analytical databases on the market. What are the attributes of StarRocks that differentiate it from other options?Can you describe the architecture of StarRocks?What are the "-ilities" that are foundational to the design of the system?How have the design and focus of the project evolved since it was first created?What are the tradeoffs involved in separating the communication layer from the data layers?The tiered architecture enables the shared nothing and shared data behaviors, which allows for the implementation of lakehouse patterns. What are some of the patterns that are possible due to the single interface/dual pattern nature of StarRocks?The shared data implementation has cacheing built in to accelerate interaction with datasets. What are some of the limitations/edge cases that operators and consumers should be aware of?StarRocks supports management of lakehouse tables (Iceberg, Delta, Hudi, etc.), which overlaps with use cases for Trino/Presto/Dremio/etc. What are the cases where StarRocks acts as a replacement for those systems vs. a supplement to them?The other major category of engines that StarRocks overlaps with is OLAP databases (e.g. Clickhouse, Firebolt, etc.). Why might someone use StarRocks in addition to or in place of those techologies?We would be remiss if we ignored the dominating trend of AI and the systems that support it. What is the role of StarRocks in the context of an AI application?What are the most interesting, innovative, or unexpected ways that you have seen StarRocks used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on StarRocks?When is StarRocks the wrong choice?What do you have planned for the future of StarRocks?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarRocksCelerDataApache DorisSIMD == Single Instruction Multiple DataApache IcebergClickHousePodcast EpisodeDruidFireboltPodcast EpisodeSnowflakeBigQueryTrinoDatabricksDremioData LakehouseDelta LakeApache HiveC++Cost-Based OptimizerIceberg Summit Tencent Games PresentationApache PaimonLancePodcast EpisodeDelta UniformApache ArrowStarRocks Python UDFDebeziumPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Applied Machine Learning for Data Science Practitioners

2025-04-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Vidya Subramanian

AI/ML Data Science ai-ml data machine-learning

A single-volume reference on data science techniques for evaluating and solving business problems using Applied Machine Learning (ML). Applied Machine Learning for Data Science Practitioners offers a practical, step-by-step guide to building end-to-end ML solutions for real-world business challenges, empowering data science practitioners to make informed decisions and select the right techniques for any use case. Unlike many data science books that focus on popular algorithms and coding, this book takes a holistic approach. It equips you with the knowledge to evaluate a range of techniques and algorithms. The book balances theoretical concepts with practical examples to illustrate key concepts, derive insights, and demonstrate applications. In addition to code snippets and reviewing output, the book provides guidance on interpreting results. This book is an essential resource if you are looking to elevate your understanding of ML and your technical capabilities, combining theoretical and practical coding examples. A basic understanding of using data to solve business problems, high school-level math and statistics, and basic Python coding skills are assumed. Written by a recognized data science expert, Applied Machine Learning for Data Science Practitioners covers essential topics, including: Data Science Fundamentals that provide you with an overview of core concepts, laying the foundation for understanding ML. Data Preparation covers the process of framing ML problems and preparing data and features for modeling. ML Problem Solving introduces you to a range of ML algorithms, including Regression, Classification, Ranking, Clustering, Patterns, Time Series, and Anomaly Detection. Model Optimization explores frameworks, decision trees, and ensemble methods to enhance performance and guide the selection of the most effective model. ML Ethics addresses ethical considerations, including fairness, accountability, transparency, and ethics. Model Deployment and Monitoring focuses on production deployment, performance monitoring, and adapting to model drift.

Exploring NATS: A Multi-Paradigm Connectivity Layer for Distributed Applications

2025-04-28 · Data Engineering Podcast Listen

podcast_episode

by Derek Collison (Synadia) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Management Datafold Java Kafka TIBCO Spotfire

Summary In this episode of the Data Engineering Podcast Derek Collison, creator of NATS and CEO of Synadia, talks about the evolution and capabilities of NATS as a multi-paradigm connectivity layer for distributed applications. Derek discusses the challenges and solutions in building distributed systems, and highlights the unique features of NATS that differentiate it from other messaging systems. He delves into the architectural decisions behind NATS, including its ability to handle high-speed global microservices, support for edge computing, and integration with Jetstream for data persistence, and explores the role of NATS in modern data management and its use cases in industries like manufacturing and connected vehicles.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Derek Collison about NATS, a multi-paradigm connectivity layer for distributed applications.Interview IntroductionHow did you get involved in the area of data management?Can you describe what NATS is and the story behind it?How have your experiences in past roles (cloud foundry, TIBCO messaging systems) informed the core principles of NATS?What other sources of inspiration have you drawn on in the design and evolution of NATS? (e.g. Kafka, RabbitMQ, etc.)There are several patterns and abstractions that NATS can support, many of which overlap with other well-regarded technologies. When designing a system or service, what are the heuristics that should be used to determine whether NATS should act as a replacement or addition to those capabilities? (e.g. considerations of scale, speed, ecosystem compatibility, etc.)There is often a divide in the technologies and architecture used between operational/user-facing applications and data systems. How does the unification of multiple messaging patterns in NATS shift the ways that teams think about the relationship between these use cases?How does the shared communication layer of NATS with multiple protocol and pattern adaptaters reduce the need to replicate data and logic across application and data layers?Can you describe how the core NATS system is architected?How have the design and goals of NATS evolved since you first started working on it?In the time since you first began writing NATS (~2012) there have been several evolutionary stages in both application and data implementation patterns. How have those shifts influenced the direction of the NATS project and its ecosystem?For teams who have an existing architecture, what are some of the patterns for adoption of NATS that allow them to augment or migrate their capabilities?What are some of the ecosystem investments that you and your team have made to ease the adoption and integration of NATS?What are the most interesting, innovative, or unexpected ways that you have seen NATS used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on NATS?When is NATS the wrong choice?What do you have planned for the future of NATS?Contact Info GitHubLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links NATSNATS JetStreamSynadiaCloud FoundryTIBCOApplied Physics Lab - Johns Hopkins UniversityCray SupercomputerRVCM Certified MessagingTIBCO ZMSIBM MQJMS == Java Message ServiceRabbitMQMongoDBNodeJSRedisAMQP == Advanced Message Queueing ProtocolPub/Sub PatternCircuit Breaker PatternZero MQAkamaiFastlyCDN == Content Delivery NetworkAt Most OnceAt Least OnceExactly OnceAWS KinesisMemcachedSQSSegmentRudderstackPodcast EpisodeDLQ == Dead Letter QueueMQTT == Message Queueing Telemetry TransportNATS Kafka Bridge10BaseT NetworkWeb AssemblyRedPandaPodcast EpisodePulsar FunctionsmTLSAuthZ (Authorization)AuthN (Authentication)NATS Auth CalloutsOPA == Open Policy AgentRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeHome AssistantPodcast.init EpisodeTailscaleOllamaCDC == Change Data CapturegRPCThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

AI Launchpad – Run Python data & AI apps w/o changes (Iceberg data storage)

2025-04-23 · AI Council 2025

launchpad_demo

AI/ML Iceberg

10-min startup demo in AI Launchpad.

Advanced Lakehouse Management With The LakeKeeper Iceberg REST Catalog

2025-04-21 · Data Engineering Podcast Listen

podcast_episode

by Viktor Kessler (Vakmo) , Tobias Macey

AI/ML Data Engineering Data Lake Data Lakehouse Data Management Datafold Iceberg Rust Cyber Security

Summary In this episode of the Data Engineering Podcast Viktor Kessler, co-founder of Vakmo, talks about the architectural patterns in the lake house enabled by a fast and feature-rich Iceberg catalog. Viktor shares his journey from data warehouses to developing the open-source project, Lakekeeper, an Apache Iceberg REST catalog written in Rust that facilitates building lake houses with essential components like storage, compute, and catalog management. He discusses the importance of metadata in making data actionable, the evolution of data catalogs, and the challenges and innovations in the space, including integration with OpenFGA for fine-grained access control and managing data across formats and compute engines.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Viktor Kessler about architectural patterns in the lakehouse that are unlocked by a fast and feature-rich Iceberg catalogInterview IntroductionHow did you get involved in the area of data management?Can you describe what LakeKeeper is and the story behind it? What is the core of the problem that you are addressing?There has been a lot of activity in the catalog space recently. What are the driving forces that have highlighted the need for a better metadata catalog in the data lake/distributed data ecosystem?How would you characterize the feature sets/problem spaces that different entrants are focused on addressing?Iceberg as a table format has gained a lot of attention and adoption across the data ecosystem. The REST catalog format has opened the door for numerous implementations. What are the opportunities for innovation and improving user experience in that space?What is the role of the catalog in managing security and governance? (AuthZ, auditing, etc.)What are the channels for propagating identity and permissions to compute engines? (how do you avoid head-scratching about permission denied situations)Can you describe how LakeKeeper is implemented?How have the design and goals of the project changed since you first started working on it?For someone who has an existing set of Iceberg tables and catalog, what does the migration process look like?What new workflows or capabilities does LakeKeeper enable for data teams using Iceberg tables across one or more compute frameworks?What are the most interesting, innovative, or unexpected ways that you have seen LakeKeeper used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on LakeKeeper?When is LakeKeeper the wrong choice?What do you have planned for the future of LakeKeeper?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links LakeKeeperSAPMicrosoft AccessMicrosoft ExcelApache IcebergPodcast EpisodeIceberg REST CatalogPyIcebergSparkTrinoDremioHive MetastoreHadoopNATSPolarsDuckDBPodcast EpisodeDataFusionAtlanPodcast EpisodeOpen MetadataPodcast EpisodeApache AtlasOpenFGAHudiPodcast EpisodeDelta LakePodcast EpisodeLance Table FormatPodcast EpisodeUnity CatalogPolaris CatalogApache GravitinoPodcast Episode KeycloakOpen Policy Agent (OPA)Apache RangerApache NiFiThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building Agentic AI Systems

2025-04-21 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Wrick Talukdar , Anjanava Biswas

AI/ML GenAI ai-ml artificial-intelligence-ai data generative-ai

In "Building Agentic AI Systems", you will explore how to design and create intelligent and autonomous AI agents that can reason, plan, and adapt. This book dives deep into the principles and practices necessary to unlock the potential of generative AI and agentic systems. From foundation to implementation, you'll gain valuable insights into cutting-edge AI architectures and functionalities. What this Book will help me do Understand the foundational concepts of generative AI and the principles of agentic systems. Develop skills to design AI agents capable of self-reflection, tool utilization, and adaptable planning. Explore strategies for ensuring ethical transparency and safety in autonomous AI systems. Learn practical techniques to build effective multi-agent AI collaborations with real-world applications. Gain insights into designing AI systems with scalability, adaptability, and minimal human intervention. Author(s) Anjanava Biswas and Wrick Talukdar are experts in AI development with years of experience working on generative AI frameworks and autonomous systems. They specialize in creating innovative AI solutions and contributing to AI best practices in the industry. Their dedication to teaching and clarity in writing make technical concepts accessible to developers at all levels. Who is it for? This book is ideal for AI developers, machine learning engineers, and software architects seeking to advance their understanding of designing and implementing intelligent autonomous AI systems. Readers should have a foundational understanding of machine learning principles and basic programming experience, particularly in Python, to follow the book effectively. Understanding of generative AI or large language models is helpful but not required. If you're aiming to build or refine your skills in agent-based AI systems and how they adapt, this book is for you.

156: I Built A Game That Simulates Your Data Career Journey

2025-04-15 · Data Career Podcast: Helping You Land a Data Analyst Job FAST Listen

podcast_episode

by Avery Smith

AI/ML Analytics Data Analytics SQL

YOU want to break into data analytics but not sure where to start? This interactive choose-your-own-adventure episode will help you! Get ready to make real-life decisions that will shape your data career. Play now and see where your choices take you. 💌 Join 10k+ aspiring data analysts & get my tips in your inbox weekly 👉 https://www.datacareerjumpstart.com/newsletter 🆘 Feeling stuck in your data journey? Come to my next free "How to Land Your First Data Job" training 👉 https://www.datacareerjumpstart.com/training 👩‍💻 Want to land a data job in less than 90 days? 👉 https://www.datacareerjumpstart.com/daa 👔 Ace The Interview with Confidence 👉 https://www.datacareerjumpstart.com/interviewsimulator ⌚ Control this audio using these timestamps: 1:54 - 1 - Data Scientist 3:48 - 2 - Data Analyst 5:42 - 3 - Python 7:36 - 4 - SQL 9:30 - 5 - Keep Learning 11:24 - 6 - Browse Some Jobs 13:18 - 7 - Move On 15:12 - 8 - Apply 17:06 - 9 - Try to Network 🔗 CONNECT WITH AVERY 🎥 YouTube Channel: https://www.youtube.com/@averysmith 🤝 LinkedIn: https://www.linkedin.com/in/averyjsmith/ 📸 Instagram: https://instagram.com/datacareerjumpstart 🎵 TikTok: https://www.tiktok.com/@verydata 💻 Website: https://www.datacareerjumpstart.com/ Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

#269: The Ins and Outs of Outliers with Brett Kennedy

2025-04-15 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Brett Kennedy , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

How is an outlier in the data like obscenity? A case could be made that they're both the sort of thing where we know it when we see it, but that can be awfully tricky to perfectly define and detect. Visualize many data sets, and some of the data points are obvious outliers, but just as many (or more) fall in a gray area—especially if they're sneaky inliers. z-score, MAD, modified z-score, interquartile range (IQR), time-series decomposition, smoothing, forecasting, and many other techniques are available to the analyst for detecting outliers. Depending on the data, though, the most appropriate method (or combination of methods) for identifying outliers can change! We sat down with Brett Kennedy, author of Outlier Detection in Python, to dig into the topic! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

talk-data.com

Activity Trend

Top Events

Top Speakers

Week 3 Focus: Intro to Modeling

Creating Tools and Working with Data in Finance (w/ Matt Dancho)

Mathematics of Machine Learning

Scaling Data Operations With Platform Engineering

Data Without Labels

From Data Discovery to AI: The Evolution of Semantic Layers

The test automation toolbox: exploring frameworks built on WebDriver — the API for browser automation

Balancing Off-the-Shelf and Custom Solutions in Data Engineering

Structured Simplicity: Using Pydantic for Data Modeling and Type Safety in Python

ScaleDown

I Built an AI Tool That Turns Any Python Code Into an Emotionally Engaging Blog Post

#300 End to End AI Application Development with Maxime Labonne, Head of Post-training at Liquid AI & Paul-Emil Iusztin, Founder at Decoding ML

StarRocks: Bridging Lakehouse and OLAP for High-Performance Analytics

Applied Machine Learning for Data Science Practitioners

Exploring NATS: A Multi-Paradigm Connectivity Layer for Distributed Applications

AI Launchpad – Run Python data & AI apps w/o changes (Iceberg data storage)

Advanced Lakehouse Management With The LakeKeeper Iceberg REST Catalog

Building Agentic AI Systems

156: I Built A Game That Simulates Your Data Career Journey

#269: The Ins and Outs of Outliers with Brett Kennedy