PyTorch

LLMs in the Trenches: What Breaks and Why (w/ Andriy Burkov)

2025-09-05 · Mavens of Data Listen

podcast_episode

by Andriy Burkov (TalentNeuron)

AI/ML Analytics LLM

LLMs seem like a hot solution now, until you try deploying one. In this episode, Andriy Burkov, machine learning expert and author of The Hundred-Page Machine Learning Book, joins us for a grounded, sometimes blunt conversation about why many LLM applications fail. We talk about sentiment analysis, difficulty with taxonomy, agents getting tripped up on formatting, and why MCP might not solve your problems. If you're tired of the hype and want to understand the real state of applied LLMs, this episode delivers. What You'll Learn: What is often misunderstood about LLMs The reliability of sentiment analysis How can we make agents more resilient? 📚 Check out Andriy's books on machine learning and LLMs: The Hundred-Page Machine Learning Book The Hundred-Page Language Models Book: hands-on with Pytorch 🤝 Follow Andriy on LinkedIn! Register for free to be part of the next live session: https://bit.ly/3XB3A8b Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter

#302 Making AI Applications like Greased Lightning with William Falcon, CEO at Lightning AI

2025-05-19 · DataFramed Listen

podcast_episode

by Richie (DataCamp) , William Falcon (Lightning AI)

AI/ML

AI tooling continues to expand with specialized solutions for every step of the development process. For data scientists and engineers, this creates a paradox: more options but potentially more complexity and integration challenges. How do you determine which tools actually improve productivity versus adding unnecessary overhead? Should you prioritize flexibility with individual best-of-breed components or streamline with integrated platforms? What's the most effective way to bridge the gap between experimentation and production-ready AI applications? William Falcon is an AI researcher and the CEO of Lightning AI. He is the creator of PyTorch Lightning, a lightweight framework designed for training models of any size. As the founder of Lightning AI, he leads the development of Lightning AI Studios and the AI Hub. Falcon also shares his expertise in AI research and machine learning engineering through educational content on YouTube and X (formerly Twitter). He is passionate about leveraging AI for social impact. In the episode, Richie and William explore the NY AI hub, the journey from AI idea to production, diverse perspectives in AI development, how Lightning AI simplifies AI workflows, the significance of open-source models, and much more. Links Mentioned in the Show: Lightning AIPyTorch LightningConnect with WilliamCourse: Introduction to Deep Learning in PyTorch CourseRelated Episode: Building Multi-Modal AI Applications with Russ d'Sa, CEO & Co-founder of LiveKitRewatch sessions from RADAR: Skills Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

#241 Getting Generative AI Into Production with Lin Qiao, CEO and Co-Founder of Fireworks AI

2024-09-05 · DataFramed Listen

podcast_episode

by Lin Qiao (Fireworks AI)

AI/ML GenAI IBM LLM

Lot’s of AI use-cases can start with big ideas and exciting possibilities, but turning those ideas into real results is where the challenge lies. How do you take a powerful model and make it work effectively in a specific business context? What steps are necessary to fine-tune and optimize your AI tools to deliver both performance and cost efficiency? And as AI continues to evolve, how do you stay ahead of the curve while ensuring that your solutions are scalable and sustainable? Lin Qiao is the CEO and Co-Founder of Fireworks AI. She previously worked at Meta as a Senior Director of Engineering and as head of Meta's PyTorch, served as a Tech Lead at Linkedin, and worked as a Researcher and Software Engineer at IBM. In the episode, Richie and Lin explore generative AI use cases, getting AI into products, foundational models, the effort required and benefits of fine-tuning models, trade-offs between models sizes, use cases for smaller models, cost-effective AI deployment, the infrastructure and team required for AI product development, metrics for AI success, open vs closed-source models, excitement for the future of AI development and much more. Links Mentioned in the Show: Fireworks.aiHugging Face - Preference Tuning LLMs with Direct Preference Optimization MethodsConnect with LinCourse - Artificial Intelligence (AI) StrategyRelated Episode: Creating Custom LLMs with Vincent Granville, Founder, CEO & Chief Al Scientist at GenAltechLab.comRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

2022-05-23 · Data Engineering Podcast Listen

podcast_episode

by Ketan Umare (Union) , Haytham Abuelfutuh (Union) , Tobias Macey

AI/ML Airflow API Arrow AWS BI CDP CI/CD Cloud Computing Dagster Data Contracts Data Engineering +15 more

Summary Machine learning has become a meaningful target for data applications, bringing with it an increase in the complexity of orchestrating the entire data flow. Flyte is a project that was started at Lyft to address their internal needs for machine learning and integrated closely with Kubernetes as the execution manager. In this episode Ketan Umare and Haytham Abuelfutuh share the story of the Flyte project and how their work at Union is focused on supporting and scaling the code and community that has made Flyte successful.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data lake architectures provide the best combination of massive scalability and cost reduction, but they aren’t always the most performant option. That’s why Kyligence has built on top of the leading open source OLAP engine for data lakes, Apache Kylin. With their AI augmented engine they detect patterns from your critical queries, automatically build data marts with optimized table structures, and provide a unified SQL interface across your lake, cubes, and indexes. Their cost-based query router will give you interactive speeds across petabyte scale data sets for BI dashboards and ad-hoc data exploration. Stop struggling to speed up your data lake. Get started with Kyligence today at dataengineeringpodcast.com/kyligence Your host is Tobias Macey and today I’m interviewing Ketan Umare and Haytham Abuelfutuh about Flyte, the open source and kubernetes-native orchestration engine for your data systems

Interview

Introduction How did you get involved in the area of data management? Can you describe what Flyte is and the story behind it? What was missing in the ecosystem of available tools that made it necessary/worthwhile to create Flyte? Workflow orchestrators have been around for several years and have gone through a number of generational shifts. How would you characterize Flyte’s position in the ecosystem?

What do you see as the closest alternatives? What are the core differentiators that might lead someone to choose Flyte over e.g. Airflow/Prefect/Dagster?

What are the core primitives that Flyte exposes for building up complex workflows?

Machine learning use cases have been a core focus since the project’s inception. What are some of the ways that that manifests in the design and feature set?

Can you describe the architecture of Flyte?

How have the design and goals of the platform changed/evolved since you first started working on it?

What are the changes in the data ecosystem that have had the most substantial impact on the Flyte project? (e.g. roadmap, integrations, pushing people toward adoption, etc.) What is the process for setting up a Flyte deployment? What are the user personas that you prioritize in the design and feature development for Flyte? What is the workflow for someone building a new pipeline in Flyte?

What are the patterns that you and the community have established to encourage discovery and reuse of granular task definitions? Beyond code reuse, how can teams scale usage of Flyte at the company/organization level?

What are the affordances that you have created to facilitate local development and testing of workflows while ensuring a smooth transition to production?

What are the patterns that are available for CI/CD of workflows using Flyte?

How have you approached the design of data contracts/type definitions to provide a consistent/portable API for defining inter-task dependencies across languages? What are the available interfaces for extending Flyte and building integrations with other components across the data ecosystem? Data orchestration engines are a natural point for generating and taking advantage of rich metadata. How do you manage creation and propagation of metadata within and across the framework boundaries? Last year you founded Union to offer a managed version of Flyte. What are the features that you are offering beyond what is available in the open source?

What are the opportunities that you see for the Flyte ecosystem with a corporate entity to invest in expanding adoption?

What are the most interesting, innovative, or unexpected ways that you have seen Flyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Flyte? When is Flyte the wrong choice? What do you have planned for the future of Flyte?

Contact Info

Ketan Umare Haytham Abuelfutuh

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

Flyte

Slack Channel

Union.ai Kubeflow Airflow AWS Step Functions Protocol Buffers XGBoost MLFlow Dagster

Podcast Episode

Prefect

Podcast Episode

Arrow Parquet Metaflow Pytorch

Podcast.init Episode

dbt FastAPI

Podcast.init Interview

Python Type Annotations Modin

Podcast.init Interview

Monad Datahub

Podcast Episode

OpenMetadata

Podcast Episode

Hudi

Podcast Episode

Iceberg

Podcast Episode

Great Expectations

Podcast Episode

Pandera Union ML Weights and Biases Whylogs

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Orchestration For Hybrid Cloud Analytics

2019-10-22 · Data Engineering Podcast Listen

podcast_episode

by Dipti Borkar (Microsoft) , Tobias Macey

AI/ML Analytics AWS Big Data Cloud Computing Data Engineering Data Lake Data Management Datacoral DWH Hadoop HDFS +14 more

Summary The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you mean by the term "Data Orchestration"?

How does it compare to the concept of "Data Virtualization"? What are some of the tools and platforms that fit under that umbrella?

What are some of the motivations for organizations to use the cloud for their data oriented workloads?

What are they giving up by using cloud resources in place of on-premises compute?

For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments? What are some of the common patterns for cloud migration projects and what challenges do they present?

Do you have advice on useful metrics to track for determining project completion or success criteria?

How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals? Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort?

What are some of the common pain points that organizations encounter when working on hybrid implementations?

What are some of the missing pieces in the data orchestration landscape?

Are there any efforts that you are aware of that are aiming to fill those gaps?

Where is the data orchestration market heading, and what are some industry trends that are driving it?

What projects are you most interested in or excited by?

For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend?

Contact Info

LinkedIn @dborkar on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Alluxio

Podcast Episode

UC San Diego Couchbase Presto

Podcast Episode

Spark SQL Data Orchestration Data Virtualization PyTorch

Podcast.init Episode

Rook storage orchestration PySpark MinIO

Podcast Episode

Kubernetes Openstack Hadoop HDFS Parquet Files

Podcast Episode

ORC Files Hive Metastore Iceberg Table Format

Podcast Episode

Data Orchestration Summit Star Schema Snowflake Schema Data Warehouse Data Lake Teradata

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Machine Learning In The Enterprise

2019-02-11 · Data Engineering Podcast Listen

podcast_episode

by Kevin Dewalt (Prolego) , Tobias Macey

AI/ML Airflow CI/CD Data Engineering Data Management Data Science DevOps Git Jenkins Keras Pandas SQL +1 more

Summary Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and maintain their first machine learning projects so that they can remain competitive in our landscape of constant change. In this episode he discusses why machine learning projects require a new set of capabilities, how to build a team from internal and external candidates, and how an example project progressed through each phase of maturity. This was a great conversation for anyone who wants to understand the benefits and tradeoffs of machine learning for their own projects and how to put it into practice.

Introduction

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Kevin Dewalt about his experiences at Prolego, building machine learning projects for Fortune 500 companies

Interview

Introduction How did you get involved in the area of data management? For the benefit of software engineers and team leaders who are new to machine learning, can you briefly describe what machine learning is and why is it relevant to them? What is your primary mission at Prolego and how did you identify, execute on, and establish a presence in your particular market?

How much of your sales process is spent on educating your clients about what AI or ML are and the benefits that these technologies can provide?

What have you found to be the technical skills and capacity necessary for being successful in building and deploying a machine learning project?

When engaging with a client, what have you found to be the most common areas of technical capacity or knowledge that are needed?

Everyone talks about a talent shortage in machine learning. Can you suggest a recruiting or skills development process for companies which need to build out their data engineering practice? What challenges will teams typically encounter when creating an efficient working relationship between data scientists and data engineers? Can you briefly describe a successful project of developing a first ML model and putting it into production?

What is the breakdown of how much time was spent on different activities such as data wrangling, model development, and data engineering pipeline development? When releasing to production, can you share the types of metrics that you track to ensure the health and proper functioning of the models? What does a deployable artifact for a machine learning/deep learning application look like?

What basic technology stack is necessary for putting the first ML models into production?

How does the build vs. buy debate break down in this space and what products do you typically recommend to your clients?

What are the major risks associated with deploying ML models and how can a team mitigate them? Suppose a software engineer wants to break into ML. What data engineering skills would you suggest they learn? How should they position themselves for the right opportunity?

Contact Info

Email: Kevin Dewalt [email protected] and Russ Rands [email protected] Connect on LinkedIn: Kevin Dewalt and Russ Rands Twitter: @kevindewalt

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Prolego Download our book: Become an AI Company in 90 Days Google Rules Of ML AI Winter Machine Learning Supervised Learning O’Reilly Strata Conference GE Rebranding Commercials Jez Humble: Stop Hiring Devops Experts (And Start Growing Them) SQL ORM Django RoR Tensorflow PyTorch Keras Data Engineering Podcast Episode About Data Teams DevOps For Data Teams – DevOps Days Boston Presentation by Tobias Jupyter Notebook Data Engineering Podcast: Notebooks at Netflix Pandas

Podcast Interview

Joel Grus

JupyterCon Presentation Data Science From Scratch

Expensify Airflow

James Meickle Interview

Git Jenkins Continuous Integration Practical Deep Learning For Coders Course by Jeremy Howard Data Carpentry

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15

2018-01-22 · Data Engineering Podcast Listen

podcast_episode

by Alex Ratner (Snorkel) , Tobias Macey

AI/ML Big Data Data Collection Data Engineering Data Management GitHub Linux TensorFlow

Summary

The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Alex Ratner about Snorkel and Dark Data

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing your definition of dark data and how Snorkel helps to extract value from it? What are some of the most challenging aspects of building labelling functions and what tools or techniques are available to verify their validity and effectiveness in producing accurate outcomes? Can you provide some examples of how Snorkel can be used to build useful models in production contexts for companies or problem domains where data collection is difficult to do at large scale? For someone who wants to use Snorkel, what are the steps involved in processing the source data and what tooling or systems are necessary to analyse the outputs for generating usable insights? How is Snorkel architected and how has the design evolved over its lifetime? What are some situations where Snorkel would be poorly suited for use? What are some of the most interesting applications of Snorkel that you are aware of? What are some of the other projects that you and your group are working on that interact with Snorkel? What are some of the features or improvements that you have planned for future releases of Snorkel?

Contact Info

Website ajratner on Github @ajratner on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Stanford DAWN HazyResearch Snorkel Christopher Ré Dark Data DARPA Memex Training Data FDA ImageNet National Library of Medicine Empirical Studies of Conflict Data Augmentation PyTorch Tensorflow Generative Model Discriminative Model Weak Supervision

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

talk-data.com

Activity Trend

Top Events

Top Speakers

LLMs in the Trenches: What Breaks and Why (w/ Andriy Burkov)

#302 Making AI Applications like Greased Lightning with William Falcon, CEO at Lightning AI

#241 Getting Generative AI Into Production with Lin Qiao, CEO and Co-Founder of Fireworks AI

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Data Orchestration For Hybrid Cloud Analytics

Machine Learning In The Enterprise

Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15