talk-data.com talk-data.com

Topic

AI/ML

Artificial Intelligence/Machine Learning

data_science algorithms predictive_analytics

9014

tagged

Activity Trend

1532 peak/qtr
2020-Q1 2026-Q1

Activities

9014 activities · Newest first

ThredUp’s Journey with Databricks: Modernizing Our Data Infrastructure

Building an AI-ready data platform requires strong governance, performance optimization, and seamless adoption of new technologies. At ThredUp, our Databricks journey began with a need for better data management and evolved into a full-scale transformation powering analytics, machine learning, and real-time decision-making. In this session, we’ll cover: Key inflection points: Moving from legacy systems to a modernized Delta Lake foundation Unity Catalog’s impact: Improving governance, access control, and data discovery Best practices for onboarding: Ensuring smooth adoption for engineering and analytics teams What’s next? Serverless SQL and conversational analytics with Genie Whether you’re new to Databricks or scaling an existing platform, you’ll gain practical insights on navigating the transition, avoiding pitfalls, and maximizing AI and data intelligence.

Transforming Government With Data and AI: Singapore GovTech's Journey With Databricks

GovTech is an agency in the Singapore Government focused on tech for good. The GovTech Chief Data Office (CDO) has built the GovTech Data Platform with Databricks at the core. As the government tech agency, we safeguard national-level government and citizen data. A comprehensive data strategy is essential to uplifting data maturity. GovTech has adopted the service model approach where data services are offered to stakeholders based on their data maturity. Their maturity is uplifted through partnership, readying them for more advanced data analytics. CDO offers a plethora of data assets in a “data restaurant” ranging from raw data to data products, all delivered via Databricks and enabled through fine-grained access control, underpinned by data management best practices such as data quality, security and governance. Within our first year on Databricks, CDO was able to save 8,000 man-hours, democratize data across 50% of the agency and achieve six-figure savings through BI consolidation.

Ursa: Augment Your Lakehouse With Kafka-Compatible Data Streaming Capabilities

As data architectures evolve to meet the demands of real-time GenAI applications, organizations increasingly need systems that unify streaming and batch processing while maintaining compatibility with existing tools. The Ursa Engine offers a Kafka-API-compatible data streaming engine built on Lakehouse (Iceberg and Delta Lake). Designed to seamlessly integrate with data lakehouse architectures, Ursa extends your lakehouse capabilities by enabling streaming ingestion, transformation and processing — using a Kafka-compatible interface. In this session, we will explore how Ursa Engine augments your existing lakehouses with Kafka-compatible capabilities. Attendees will gain insights into Ursa Engine architecture and real-world use cases of Ursa Engine. Whether you're modernizing legacy systems or building cutting-edge AI-driven applications, discover how Ursa can help you unlock the full potential of your data.

Validating Clinical Trial Platforms on Databricks

Clinical Trial Data is undergoing a renaissance with new insights and data sources being added daily. The speed of new innovations and modalities that are found within trials poses an existential dilemma for 21CFR Part 11 compliance. In these validated environments, new components and methods need to be tested for reproducibility and restricted data access. In classical systems, this validation process would often have taken three months or more due to the manual validation process via validation scripts like Installation Qualification (IQ) and Operational Qualification (OQ) scripts. In conjunction with Databricks, Purgo AI has developed a new technology leveraging generative AI to automate the execution of IQ and OQ scripts and has drastically reduced the amount of time for validating Databricks from three months to less than a day. This drastic speedup of validation will enable the continuous flow of new ideas and implementations for clinical trials.

Today, we’re joined by Todd Olson, co-founder and CEO of Pendo, the world’s first software experience management platform. We talk about: Offloading work from employees to digital workersWhen most people will opt to chat with an AI agent over a humanThe need for SaaS apps to transform themselves into agentic appsAdvice for serial SaaS entrepreneurs, including a big cautionary tale for startupsAI-generated and AI-maintained code and the ease of prototyping

Today, I’m responding to a listener's question about what it takes to succeed as a data or AI product manager, especially if you’re coming from roles like design/BI/data visualization, data science/engineering, or traditional software product management. This reader correctly observed that most of my content “seems more targeted at senior leadership” — and had asked if I could address this more IC-oriented topic on the show. I’ll break down why technical chops alone aren’t enough, and how user-centered thinking, business impact, and outcome-focused mindsets are key to real success — and where each of these prior roles brings strengths and/or weaknesses. I’ll also get into the evolving nature of PM roles in the age of AI, and what I think the super-powered AI product manager will look like.

Highlights/ Skip to:

Who can transition into an AI and data product management role? What does it take? (5:29) Software product managers moving into  AI product management (10:05) Designers moving into data/AI product management (13:32) Moving into the AI PM role from the engineering side (21:47) Why the challenge of user adoption and trust is often the blocker to the business value (29:56) Designing change management into AI/data products as a skill (31:26) The challenge of value creation vs. delivery work — and how incentives are aligned for ICs  (35:17) Quantifying the financial value of data and AI product work(40:23)

Quotes from Today’s Episode

“Who can transition into this type of role, and what is this role? I’m combining these two things. AI product management often seems closely tied to software companies that are primarily leveraging AI, or trying to, and therefore, they tend to utilize this AI product management role. I’m seeing less of that in internal data teams, where you tend to see data product management more, which, for me, feels like an umbrella term that may include traditional analytics work, data platforms, and often AI and machine learning. I’m going to frame this more in the AI space, primarily because I think AI tends to capture the end-to-end product than data product management does more frequently.” — Brian (2:55)

“There are three disciplines I’m going to talk about moving into this role. Coming into AI and data PM from design and UX, coming into it from data engineering (or just broadly technical spaces), and then coming into it from software product management. I think software product management and moving into the AI product management - as long as you’re not someone that has two years of experience, and then 18 years of repeating the second year of experience over and over again - and you’ve had a robust product management background across some different types of products; you can show that the domain doesn’t necessarily stop you from producing value. I think you will have the easiest time moving into AI product management because you’ve shown that you can adapt across different industries.” - Brian (9:45)

“Let’s talk about designers next. I’m going to include data visualization, user experience research, user experience design, product design, all those types of broad design, category roles. Moving into data and/or AI product management, first of all, you don’t see too many—I don’t hear about too many designers wanting to move into DPM roles, because oftentimes I don’t think there’s a lot of heavy UI and UX all the time in that space. Or at least the teams that are doing that work feel that’s somebody else’s job because they’re not doing end-to-end product thinking the way I talk about it, so therefore, a lot of times they don’t see the application, the user experience, the human adoption, the change management, they’re just not looking at the world that way, even though I think they should be.” - Brian (13:32)

“Coming at this from the data and engineering side, this is the classic track for data product management. At least that is the way I tend to see it. I believe most companies prefer to develop this role in-house. My biggest concern is that you end up with job title changes, but not necessarily the benefits that are supposed to come with this. I do like learning by doing, but having a coach and someone senior who can coach your other PMs is important because there’s a lot of information that you won’t necessarily get in a class or a course. It’s going to come from experience doing the work.” - Brian (22:26)

“This value piece is the most important thing, and I want to focus on that. This is something I frequently discuss in my training seminar: how do we attach financial value to the work we’re doing? This is both art and science, but it’s a language that anyone in a product management role needs to be comfortable with. If you’re finding it very hard to figure out how your data product contributes financial value because it’s based on this waterfalling of “We own the model, and it’s deployed on a platform.” The platform then powers these other things, which in turn power an application. How do we determine the value of our tool? These things are challenging, and if it’s challenging for you, guess how hard it will be for stakeholders downstream if you haven’t had the practice and the skills required to understand how to estimate value, both before we build something as well as after?” - Brian (31:51)

“If you don’t want to spend your time getting to know how your business makes money or creates value, then [AI and data product management work] is not for you. It’s just not. I would stay doing what you’re doing already or find a different thing because a lot of your time is going to be spent “managing up” for half the time, and then managing the product stuff “down.” Then, sitting in this middle layer, trying to explain to the business what’s going to come out and what the impact is going to be, in language that they care about and understand. You can't be talking about models, model accuracy, data pipelines, and all that stuff. They’re not going to care about any of that. - Brian (34:08)

Matthew Scullion (CEO, Co-Founder of Matillion) joins me to chat about the future of data engineering, namely agentic data engineering teams.

What does this new world look like? Matthew shares some ideas of what he's building at Matillion, and the broader context of what agentic AI means for the data ecosystem, teams, and workflows.

The course is designed to cover advanced concepts and workflows in machine learning operations. It starts by introducing participants to continuous integration (CI) and continuous development (CD) workflows within machine learning projects, guiding them through the deployment of a sample CI/CD workflow using Databricks in the first section. Moving on to the second part, participants delve into data and model testing, where they actively create tests and automate CI/CD workflows. Finally, the course concludes with an exploration of model monitoring concepts, demonstrating the use of Lakehouse Monitoring to oversee machine learning models in production settings. Pre-requisites: Familiarity with Databricks workspace and notebooks; knowledge of machine learning model development and deployment with MLflow (e.g. intermediate-level knowledge of traditional ML concepts, development with CI/CD, the use of Python and Git for ML projects with popular platforms like GitHub) Labs: Yes Certification Path: Databricks Certified Machine Learning Professional

In this course, you’ll learn how to use the features Databricks provides for business intelligence needs: AI/BI Dashboards and AI/BI Genie. As a Databricks Data Analyst, you will be tasked with creating AI/BI Dashboards and AI/BI Genie Spaces within the platform, managing the access to these assets by stakeholders and necessary parties, and maintaining these assets as they are edited, refreshed, or decommissioned over the course of their lifespan. This course intends to instruct participants on how to design dashboards for business insights, share those with collaborators and stakeholders, and maintain those assets within the platform. Participants will also learn how to utilize AI/BI Genie Spaces to support self-service analytics through the creation and maintenance of these environments powered by the Databricks Data Intelligence Engine. Pre-requisites: The content was developed for participants with these skills/knowledge/abilities: A basic understanding of SQL for querying existing data tables in Databricks. Prior experience or basic familiarity with the Databricks Workspace UI. A basic understanding of the purpose and use of statistical analysis results. Familiarity with the concepts around dashboards used for business intelligence. Labs: Yes

In this course, you’ll learn how to define and schedule data pipelines that incrementally ingest and process data through multiple tables on the Data Intelligence Platform, using Lakeflow Declarative Pipelines in Spark SQL and Python. We’ll cover topics like how to get started with Lakeflow Declarative Pipelines, how Lakeflow Declarative Pipelines tracks data dependencies in data pipelines, how to configure and run data pipelines using the Lakeflow Declarative Pipelines. UI, how to use Python or Spark SQL to define data pipelines that ingest and process data through multiple tables on the Data Intelligence Platform, using Auto Loader and Lakeflow Declarative Pipelines, how to use APPLY CHANGES INTO syntax to process Change Data Capture feeds, and how to review event logs and data artifacts created by pipelines and troubleshoot syntax.By streamlining and automating reliable data ingestion and transformation workflows, this course equips you with the foundational data engineering skills needed to help kickstart AI use cases. Whether you're preparing high-quality training data or enabling real-time AI-driven insights, this course is a key step in advancing your AI journey.Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.)Labs: NoCertification Path: Databricks Certified Data Engineer Associate

This course provides participants with information and practical experience in building advanced LLM (Large Language Model) applications using multi-stage reasoning LLM chains and agents. In the initial section, participants will learn how to decompose a problem into its components and select the most suitable model for each step to enhance business use cases. Following this, participants will construct a multi-stage reasoning chain utilizing LangChain and HuggingFace transformers. Finally, participants will be introduced to agents and will design an autonomous agent using generative models on Databricks. Pre-requisites: Solid understanding of natural language processing (NLP) concepts, familiarity with prompt engineering and prompt engineering best practices, experience with the Databricks Data Intelligence Platform, experience with retrieval-augmented generation (RAG) techniques including data preparation, building RAG architectures, and concepts like embeddings, vectors, and vector databases Labs: Yes Certification Path: Databricks Certified Generative AI Engineer Associate

In this course, you’ll learn how to develop traditional machine learning models on Databricks. We’ll cover topics like using popular ML libraries, executing common tasks efficiently with AutoML and MLflow, harnessing Databricks' capabilities to track model training, leveraging feature stores for model development, and implementing hyperparameter tuning. Additionally, the course covers AutoML for rapid and low-code model training, ensuring that participants gain practical, real-world skills for streamlined and effective machine learning model development in the Databricks environment. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. common Python libraries for DS/ML like Scikit-Learn, fundamental ML algorithms like regression and classification, model evaluation with common metrics) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

AI/BI for Self-Service Analytics

In this course, you will learn how to self-serve business insights from your company’s Databricks Data Intelligence Platform using AI/BI. After a tour of the fundamental components of the platform, you’ll learn how to interact with pre-created AI/BI Dashboards to explore your company’s data through existing charts and visualizations. You’ll also learn how to use AI/BI Genie to go beyond dashboards by asking follow-up questions in natural language to self-serve new insights, create visualizations, and share them with your colleagues. Pre-requisites: A working understanding of your organization’s business and key performance indicators. Labs: No Certification Path: N/A

Data Ingestion with Lakeflow Connect

In this course, you’ll learn how to have efficient data ingestion with Lakeflow Connect and manage that data. Topics include ingestion with built-in connectors for SaaS applications, databases and file sources, as well as ingestion from cloud object storage, and batch and streaming ingestion. We'll cover the new connector components, setting up the pipeline, validating the source and mapping to the destination for each type of connector. We'll also cover how to ingest data with Batch to Streaming ingestion into Delta tables, using the UI with Auto Loader, automating ETL with Lakeflow Declarative Pipelines or using the API.This will prepare you to deliver the high-quality, timely data required for AI-driven applications by enabling scalable, reliable, and real-time data ingestion pipelines. Whether you're supporting ML model training or powering real-time AI insights, these ingestion workflows form a critical foundation for successful AI implementation.Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.Labs: NoCertification Path: Databricks Certified Data Engineer Associate

In this course, you’ll learn the fundamentals of preparing data for machine learning using Databricks. We’ll cover topics like exploring, cleaning, and organizing data tailored for traditional machine learning applications. We’ll also cover data visualization, feature engineering, and optimal feature storage strategies. By building a strong foundation in data preparation, this course equips you with the essential skills to create high-quality datasets that can power accurate and reliable machine learning and AI models. Whether you're developing predictive models or enabling downstream AI applications, these capabilities are critical for delivering impactful, data-driven solutions. Pre-requisites: Familiarity with Databricks workspace, notebooks, as well as Unity Catalog. An intermediate level knowledge of Python (scikit-learn, Matplotlib), Pandas, and PySpark. As well as with concepts of exploratory data analysis, feature engineering, standardization, and imputation methods). Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

This course is designed to introduce participants to contextual GenAI (generative artificial intelligence) solutions using the retrieval-augmented generation (RAG) method. Firstly, participants will be introduced to the RAG architecture and the significance of contextual information using Mosaic AI Playground. Next, the course will demonstrate how to prepare data for GenAI solutions and connect this process with building an RAG architecture. Finally, participants will explore concepts related to context embedding, vectors, vector databases, and the utilization of the Mosaic AI Vector Search product. Pre-requisites: Familiarity with embeddings, prompt engineering best practices, and experience with the Databricks Data Intelligence Platform Labs: Yes Certification Path: Databricks Certified Generative AI Engineer Associate

The course intends to equip professional-level machine learning practitioners with knowledge and hands-on experience in utilizing Apache Spark™ for machine learning purposes, including model fine-tuning. Additionally, the course covers using the Pandas library for scalable machine learning tasks. The initial section of the course focuses on comprehending the fundamentals of Apache Spark™ along with its machine learning capabilities. Subsequently, the second section delves into fine-tuning models using the hyperopt library. The final segment involves learning the implementation of the Pandas API within Apache Spark™, encompassing guidance on Pandas UDFs (User-Defined Functions) and the Functions API for model inference. Pre-requisites: Familiarity with Databricks workspace and notebooks; knowledge of machine learning model development and deployment with MLflow (e.g. basic understanding of DS/ML concepts, common model metrics and python libraries as well as a basic understanding of scaling workloads with Spark) Labs: Yes Certification Path: Databricks Certified Machine Learning Professional

This in-person, full-day hackathon focuses on the development of innovative AI Agents using the Databricks Data Intelligence Platform. Collaborating in teams of up to four, participants will utilize Databricks' specialized agent authoring and evaluation tools to build, test, and refine intelligent agent systems. Diverse datasets from the Databricks Marketplace are available to enhance agent capabilities. The objective is to produce a compelling proof-of-concept agent showcasing creativity, intelligent data utilization, and effective tool-calling in a novel and useful manner. This event provides a platform for demonstrating technical quality with Databricks tools, creativity in agent design or application, and clarity of purpose. The hackathon promotes hands-on experience with cutting-edge agent development tools and concludes with short team demonstrations of proofs of concept created during the event. Three finalist teams will be selected, and the winners will be announced at the end of the Hackathon. Cash prizes will be awarded to the top teams, with $10,000 for first place, $5,000 for second place, and $2,500 for third place. Complete details regarding eligibility and the rules governing this hackathon are available in the official rules, available at http://bit.ly/44HRyxz. In the event of any discrepancies between the official rules and other hackathon materials, the official rules govern. Agenda: 7:30am Registration/Breakfast 8:15am Opening Ceremony 8:30am Hacking Begins 12:00pm- 1:30pm Lunch 2:30pm Hacking Ends 2:30pm- 3:45pm Expo/Judging 3:45pm Closing Ceremony/Winners Announced 4:00pm Hackathon Ends

Retrieval Augmented Generation (RAG) continues to be a foundational approach in AI despite claims of its demise. While some marketing narratives suggest RAG is being replaced by fine-tuning or long context windows, these technologies are actually complementary rather than competitive. But how do you build a truly effective RAG system that delivers accurate results in high-stakes environments? What separates a basic RAG implementation from an enterprise-grade solution that can handle complex queries across disparate data sources? And with the rise of AI agents, how will RAG evolve to support more dynamic reasoning capabilities? Douwe Kiela is the CEO and co-founder of Contextual AI, a company at the forefront of next-generation language model development. He also serves as an Adjunct Professor in Symbolic Systems at Stanford University, where he contributes to advancing the theoretical and practical understanding of AI systems. Before founding Contextual AI, Douwe was the Head of Research at Hugging Face, where he led groundbreaking efforts in natural language processing and machine learning. Prior to that, he was a Research Scientist and Research Lead at Meta’s FAIR (Fundamental AI Research) team, where he played a pivotal role in developing Retrieval-Augmented Generation (RAG)—a paradigm-shifting innovation in AI that combines retrieval systems with generative models for more grounded and contextually aware responses. In the episode, Richie and Douwe explore the misconceptions around the death of Retrieval Augmented Generation (RAG), the evolution to RAG 2.0, its applications in high-stakes industries, the importance of metadata and entitlements in data governance, the potential of agentic systems in enterprise settings, and much more. Links Mentioned in the Show: Contextual AIConnect with DouweCourse: Retrieval Augmented Generation (RAG) with LangChainRelated Episode: High Performance Generative AI Applications with Ram Sriharsha, CTO at PineconeRegister for RADAR AI - June 26 New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Scaling AI workloads with Ray & Airflow

Ray is an open-source framework for scaling Python applications, particularly machine learning and AI workloads. It provides the layer for parallel processing and distributed computing. Many large language models (LLMs), including OpenAI's GPT models, are trained using Ray.

On the other hand, Apache Airflow is a consolidated data orchestration framework downloaded more than 20 million times monthly.

This talk presents the Airflow Ray provider package that allows users to interact with Ray from an Airflow workflow. In this talk, I'll show how to use the package to create Ray clusters and how Airflow can trigger Ray pipelines in those clusters.