talk-data.com talk-data.com

Topic

MLOps

machine_learning devops ai

233

tagged

Activity Trend

26 peak/qtr
2020-Q1 2026-Q1

Activities

233 activities · Newest first

Summary When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Paige Roberts about machine learning workflows inside the database

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of the current state of the market for databases that support in-process machine learning?

What are the motivating factors for running a machine learning workflow inside the database?

What styles of ML are feasible to do inside the database? (e.g. bayesian inference, deep learning, etc.) What are the performance implications of running a model training pipeline within the database runtime? (both in terms of training performance boosts, and database performance impacts) Can you describe the architecture of how the machine learning process is managed by the database engine? How do you manage interacting with Python/R/Jupyter/etc. when working within the database? What is the impact on data pipeline and MLOps architectures when using the database to manage the machine learning workflow? What are the most interesting, innovative, or unexpected ways that you have seen in-database ML used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on machine learning inside the database? When is in-database ML the wrong choice? What are the recent trends/

We talked about:

Demetrious’ background and starting the MLOps community Growing MLOps community Community moderations and dealing with problems Becoming a community and connecting with people Feeling belonged Managing a community as an introvert Keeping communities active Doing custdev and talking to users Random coffee and meeting with community members Organizing community activities Is community a business? Five steps for starting a community in 2021 Shameless plug from Demetrious

Links:

https://mlops.community/

Join DataTalks.Club: https://datatalks.club/slack.html​

We talked about:

Lars’ career Doing DataOps before it existed What is DataOps Data platform Main components of the data platform and tools to implement it Books about functional programming principles Batch vs Streaming Maturity levels Building self-service tools MLOps vs DataOps Data Mesh Keeping track of transformations Lake house

Links:

https://www.scling.com/reading-list/ https://www.scling.com/presentations/

Join DataTalks.Club: https://datatalks.club/slack.html​​​

Summary The process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worked to democratize the technologies necessary to make machine learning operations maintainable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Yaron Haviv about Iguazio, a platform for end to end automation of machine learning applications using MLOps principles.

Interview

Introduction How did you get involved in the area of data science & analytics? Can you start by giving an overview of what Iguazio is and the story of how it got started? How would you characterize your target or typical customer? What are the biggest challenges that you see around building production grade workflows for machine learning?

How does Iguazio help to address those complexities?

For customers who have already invested in the technical and organizational capacity for data science and data engineering, how does Iguazio integrate with their environments? What are the responsibilities of a data engineer throughout the different stages of the lifecycle for a machine learning application? Can you describe how the Iguazio platform is architected?

How has the design of the platform evolved since you first began working on it? How have the industry best practices around bringing machine learning to production changed?

How do you approach testing/validation of machine learning applications and releasing them to production environments? (e.g. CI/CD) Once a model is in

We covered:

What is MLOps The difference between MLOps and ML Engineering Getting into MLOps Kubeflow and its components, ML Platforms Learning Kubeflow DataOps 

And other things

Links:

Microsoft MLOps maturity model: https://docs.microsoft.com/en-us/azure/architecture/example-scenario/mlops/mlops-maturity-model Google MLOps maturity levels: https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning MLOps roadmap 2020-2025: https://github.com/cdfoundation/sig-mlops/blob/master/roadmap/2020/MLOpsRoadmap2020.md Kubeflow website: https://www.kubeflow.org/ TFX Paper: https://research.google/pubs/pub46484/

Join DataTalks.Club: https://datatalks.club​​

Summary As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts sustainable, the core capability they need is for data scientists and analysts to be able to build and deploy features in a self service manner. As a result the feature store is becoming a required piece of the data platform. To fill that need Kevin Stumpf and the team at Tecton are building an enterprise feature store as a service. In this episode he explains how his experience building the Michelanagelo platform at Uber has informed the design and architecture of Tecton, how it integrates with your existing data systems, and the elements that are required for well engineered feature store.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Kevin Stumpf about Tecton and the role that the feature store plays in a modern MLOps platform

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Tecton and your motivation for starting the business? For anyone who isn’t familiar with the concept, what is an example of a feature? How do you define what a feature store is? What role does a feature store play in the overall lifecycle of a machine learning p

ML Ops: Operationalizing Data Science

More than half of the analytics and machine learning (ML) models created by organizations today never make it into production. Instead, many of these ML models do nothing more than provide static insights in a slideshow. If they aren’t truly operational, these models can’t possibly do what you’ve trained them to do. This report introduces practical concepts to help data scientists and application engineers operationalize ML models to drive real business change. Through lessons based on numerous projects around the world, six experts in data analytics provide an applied four-step approach—Build, Manage, Deploy and Integrate, and Monitor—for creating ML-infused applications within your organization. You’ll learn how to: Fulfill data science value by reducing friction throughout ML pipelines and workflows Constantly refine ML models through retraining, periodic tuning, and even complete remodeling to ensure long-term accuracy Design the ML Ops lifecycle to ensure that people-facing models are unbiased, fair, and explainable Operationalize ML models not only for pipeline deployment but also for external business systems that are more complex and less standardized Put the four-step Build, Manage, Deploy and Integrate, and Monitor approach into action

From theory to reality: How AI transforms network operations

AI is reshaping NetOps from scripted automation to intelligent, data driven workflows. We will show uses: incident triage, knowledge retrieval, traffic analysis, prediction, and contrast legacy monitoring with ML, NLP, and LLMs. See how RAG, text to SQL, and agent workflows enable real time insights across hybrid data. We will outline data pipelines and MLOps, address accuracy, reliability, cost, compliance, and weigh build vs buy. We will cover API integration and human in the loop guardrails.