As generative and agentic AI systems move from prototypes to production, builders must balance innovation with trust, safety, and compliance. This talk explores the unique evaluation and monitoring challenges of next-generation AI, with healthcare as a case study of one of the most regulated domains: Evaluation gaps: why conventional benchmarks miss multi-step reasoning, tool use, and domain-specific workflows—and how contamination and fragile metrics distort results. Bias and safety: demographic bias, hallucinations, and unsafe autonomy that trigger regulatory, legal, and contractual obligations for fairness and safety assessments. Continuous monitoring: practical MLOps strategies for drift detection, risk scoring, and compliance auditing in deployed systems. Tools and standards: open-source libraries like LangTest and HELM, new stress-test and red teaming datasets, and emerging guidance from NIST, CHAI, and ISO. While the examples draw heavily from healthcare, the lessons are broadly applicable to anyone building and deploying generative or agentic AI systems in highly regulated industries where safety, fairness, and compliance are paramount.
talk-data.com
Speaker
David Talby
4
talks
David Talby is the CEO of John Snow Labs and Pacific AI, helping organizations apply artificial intelligence to solve real-world problems in healthcare and life sciences. He has extensive experience building and running web-scale software platforms and teams at startups, open-source projects, Microsoft’s Bing in the US and Europe, and in scaling Amazon’s financial systems in Seattle and the UK. Talby holds a Ph.D. in Computer Science and master’s degrees in Computer Science and Business Administration. He was named USA CTO of the Year by the Global 100 Awards in 2022, Game Changers Awards in 2023, and ACQ5 Global Awards in 2025.
Bio from: Databricks DATA + AI Summit 2023
Filter by Event / Source
Talks & appearances
4 activities · Newest first
Large language models provide a leap in capabilities on understanding medical language and context - from passing the US medical licensing exam to summarizing clinical notes. They also suffer from a wide range of issues - hallucinations, robustness, privacy, bias – blocking many use cases. This session shares currently deployed software, lessons learned, and best practices that John Snow Labs has learned while enabling academic medical centers, pharmaceuticals, and health IT companies to build LLM-based solutions.
First, we cover benchmarks for new healthcare-specific large language models, showing how tuning LLM’s specifically on medical data and tasks results in higher accuracy on use cases such as question answering, information extraction, and summarization, compared to general-purpose LLM’s like GPT-4. Second, we share an architecture for medical chatbots that tackles issues of hallucinations, outdated content, privacy, and building a longitudinal view of patients. Third, we present a comprehensive solution for testing LLM’s beyond accuracy – for bias, fairness, representation, robustness, and toxicity – using the open-source nlptest library.
Talk by: David Talby
Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz
Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc
Ask questions from a panel of data science experts who have deployed LLMs and AI models into production.
Talk by: David Talby, Conor Murphy, Cheng Yin Eng, Sam Raymond, and Colton Peltier
Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc
As generative and agentic AI systems move from prototypes to production, builders must balance innovation with trust, safety, and compliance. This talk covers evaluation gaps (multistep reasoning, tool use, domain-specific workflows; contamination and fragile metrics), bias and safety (demographic bias, hallucinations, unsafe autonomy with regulatory and legal obligations), continuous monitoring (MLOps strategies for drift detection, risk scoring, and compliance auditing in deployed systems), and tools and standards (open-source libraries like LangTest and HELM, stress-test and red-teaming datasets, and guidance from NIST, CHAI, and ISO).