talk-data.com talk-data.com

Topic

NLP

Natural Language Processing (NLP)

ai machine_learning text_analysis

252

tagged

Activity Trend

24 peak/qtr
2020-Q1 2026-Q1

Activities

252 activities · Newest first

Explainable Data Drift for NLP

Detecting data drift, although far from solved-for tabular data, has become a common approach to monitor ML models in production. For Natural Language Processing (NLP) on the other hand the question remains mostly open. In this session, we will present and compare two approaches. In the first approach, we will demonstrate how by extracting a wide range of explainable properties per document such as topics, language, sentiment, named entities, keywords and more we are able to explore potential sources of drift. We will show how these properties can be consistently tracked over time, how they can be used to detect meaningful data drift as soon as it occurs and how they can be used to explain and fix the root cause.

The second approach we will present is to detect drift by using the embeddings of common foundation models (such as GPT3 in the Open AI model family) and use them to identify areas in the embedding space in which significant drift has occurred. These areas in embedding space should then be characterized in a human-readable way to enable root cause analysis of the detected drift. We will compare the performance and explainability of these two methods and explore the pros and cons of each approach.

Talk by: Noam Bressler

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Scaling AI Applications with Databricks, HuggingFace and Pinecone

The production and management of large-scale vector embeddings can be a challenging problem. The integration of Databricks, Hugging Face and Pinecone offers a powerful solution. Vector embeddings have become an essential tool in the development of AI powered applications. Embeddings are representations of data learned by machine models. High quality embeddings are unlocking use cases like semantic search, recommendation engines, and anomaly detection. Databricks' Apache Spark™ ecosystem together with Hugging Face's Transformers library enable large-scale vector embeddings production using GPU processing, Pinecone's vector database provides ultra-low latency querying and upserting of billions of embeddings, allowing for high-quality embeddings at scale for real-time AI apps.

In this session, we will present a concrete use case of this integration in the context of a natural language processing application. We will demonstrate how Pinecone's vector database can be integrated with Databricks and Hugging Face to produce large-scale vector embeddings of text data and how these embeddings can be used to improve the performance of various AI applications. You will see the benefits of this integration in terms of speed, scalability, and cost efficiency. By leveraging the GPU processing capabilities of Databricks and the ultra low-latency querying capabilities of Pinecone, we can significantly improve the performance of NLP tasks while reducing the cost and complexity of managing large-scale vector embeddings. You will learn about the technical details of this integration and how it can be implemented in your own AI projects, and gain insights into the speed, scalability, and cost efficiency benefits of using this solution.

Talk by: Roie Schwaber-Cohen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP

In this talk, you will learn about how retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge intensive tasks using frozen language models (LM) and retrieval models (RM). Existing work has combined these in simple “retrieve-then-read” pipelines in which the RM retrieves passages that are inserted into the LM prompt.

To begin to fully realize the potential of frozen LMs and RMs, we propose Demonstrate–Search–Predict (DSP), a framework that relies on passing natural language texts in sophisticated pipelines between an LM and an RM. DSP can express high-level programs that bootstrap pipeline-aware demonstrations, search for relevant passages, and generate grounded predictions, systematically breaking down problems into small transformations that the LM and RM can handle more reliably.

We have written novel DSP programs for answering questions in open-domain, multi-hop, and conversational settings, establishing in early evaluations new state-of-the-art in-context learning results and delivering 37–125%, 8–40%, and 80–290% relative gains against vanilla LMs, a standard retrieve-then-read pipeline, and a contemporaneous self-ask pipeline, respectively.

Talk by: Keshav Santhanam

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

Using Databricks, we built a “Unified Talent Solution” backed by a robust data and AI engine for analyzing skills of a combined pool of permanent employees, contractors, part-time employees and vendors, inferring skill gaps, future trends and recommended priority areas to bridge talent gaps, which ultimately greatly improved operational efficiency, transparency, commercial model, and talent experience of our client. We leveraged a variety of ML algorithms such as boosting, neural networks and NLP transformers to provide better AI-driven insights.

One inevitable part of developing these models within a typical DS workflow is iteration. Databricks' end-to-end ML/DS workflow service, MLflow, helped streamline this process by organizing them into experiments that tracked the data used for training/testing, model artifacts, lineage and the corresponding results/metrics. For checking the health of our models using drift detection, bias and explainability techniques, MLflow's deploying, and monitoring services were leveraged extensively.

Our solution built on Databricks platform, simplified ML by defining a data-centric workflow that unified best practices from DevOps, DataOps, and ModelOps. Databricks Feature Store allowed us to productionize our models and features jointly. Insights were done with visually appealing charts and graphs using PowerBI, plotly, matplotlib, that answer business questions most relevant to clients. We built our own advanced custom analytics platform on top of delta lake as Delta’s ACID guarantees allows us to build a real-time reporting app that displays consistent and reliable data - React (for front-end), Structured Streaming for ingesting data from Delta table with live query analytics on real time data ML predictions based on analytics data.

Talk by: Nitu Nivedita

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using NLP to Evaluate 100 Million Global Webpages Daily to Contextually Target Consumers

This session will cover the challenges and the solution that The Trade Desk went through to scale their ML models for NLP for 100 million web pages per day.

TTD's contextual targeting team needs to analyze 100 million web pages per day. Fifty percent of the webpages are non-English. Half of the content was not being properly analyzed and targeted intelligently. TTD attempted to build a model using Spark NLP, however the package could not scale and was not cost-effective. GPU utilization was low and the solution was cost prohibitive. TTD engaged with Databricks in early 2022 to build an NLP model on Databricks. Our teams partnered closely together. We were able to build a solution using distributed inference (150-200 GPUs running at 80%+ utilization); Each day, Databricks translated two hundred times faster across 50 million web pages that are in for over 35 + languages and at a fraction of the cost. This solution enables TTD teams to standardize on English for contextual targeting ML models. TTD can now be a one-stop shop for their customers' global advertising needs.

The Trade Desk is headquartered in Ventura, California. It is the largest independent demand-side platform in the world, competing against Google, Facebook, and others. Unlike traditional marketing, programmatic marketing is operated by real-time, split-second decisions based on user identity, device information, and other data points. It enables highly personalized consumer experiences and improves return-on-investment for companies and advertisers.

Talk by: Xuefu Wang and Mark Lee

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Maddie is a Sr. ML / Research Engineer in industry, published author and seasoned open-source AI leader, with 6+ years of experience in ML R&D. Her areas of interest include generative models, NLP and Human <> AI interactions. She was also a 2x startup founder, a Blockchain educator/researcher, Founder of Women Who Code - Data Science, and technical advisor to various startups and Di…

Learning Journey - Session 4 of the Certification Study Group for Google Cloud Professional Machine Learning Engineer certification. Speakers: Roya Kandalan (Senior Research Scientist, Aware, Inc.); Maddie Shang (Sr. AI Research Engineer, OpenMined); Linda Kovacs (Software Engineer, Accenture).

Présentation de la version beta du Code Interpreter et démonstration de ses capacités, notamment: calculs mathématiques (algèbre, trigonométrie, statistiques), manipulation et analyse de données, visualisation de données, exécution de scripts Python, entraînement et évaluation de modèles d'apprentissage automatique, et traitement du texte et du langage naturel (tokenisation, stemming, fréquence des mots, etc.). Notez que l’outil est restreint par des règles de sécurité (pas d’accès à Internet ni de requêtes à des API externes ou téléchargement de fichiers via le web).

How would you model the mental hops that lead from one word to the next? And how about when instead of a word, the starting point are concepts grounded explicitly or implicitly in an image? These questions, and more, were the topic of my latest research project. Working to automatically generate image-term pairs for an image-grounded, collaborative Wordle game, I looked for combinations that spark the desired type of dialogue - illuminating the participants' decision-making. The project fits the broader efforts toward natural language explainability that Prof. Schlangen’s research group at the University of Potsdam is undertaking. We will look at the method I developed from an engineering perspective, going over all the NLP concepts composing it, and touch upon a bit of linguistics theory too. Level: Beginner to the domain (already familiar with Python)

Maddie Shang - OpenMined (Sr. AI Research Engineer)\n\nMaddie is a Sr. ML / Research Engineer in industry, published author and seasoned open-source AI leader, with 6+ years of experience in ML R&D. Her areas of interest include generative models, NLP and Human <> AI interactions. She was also a 2x startup founder, a Blockchain educator/researcher, Founder of Women Who Code - Data Science, and technical advisor to various startups and Di…

The first is recording patient-doctor consultations: transcription using audio-to-text models, and then summarising and reformatting into professional clinical documentation using LLMs. The second is using NLP and vector databases to do efficient search of clinical guideline documents, and ability to interact with the guidelines using NLP.

Maddie is a Sr. ML / Research Engineer in industry, published author and seasoned open-source AI leader, with 6+ years of experience in ML R&D. Her areas of interest include generative models, NLP and Human <> AI interactions. She was also a 2x startup founder, a Blockchain educator/researcher, Founder of Women Who Code - Data Science, and technical advisor to various startups and Di…

There's a rush from many countries to regulate AI. Murielle Popa-Fabre is an NLP and ML expert currently working for the Council of Europe, building an international Framework Convention on AI that will touch a wider number of countries (46) than the EU AI Act (only for EU countries). We chat about her path from academia to working in regulation, and the upcoming EU, the Council of Europe, and G7 regulations on AI. These regulations will have a historical impact on what happens next with AI, and it will be very interesting to see where things go from here.

Murielle's LinkedIn: https://www.linkedin.com/in/murielle-popa-fabre-b563187b/


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Subscribe to my Substack: https://joereis.substack.com/

Continuous Data Pipeline for Real time Benchmarking & Data Set Augmentation | Teleskope

ABOUT THE TALK: Building and curating representative datasets is crucial for accurate ML systems. Monitoring metrics post-deployment helps improve the model. Unstructured language models may face data shifts, leading to unpredictable inferences. Open-source APIs and annotation tools streamline annotation and reduce analyst workload.

This talk discusses generating datasets and real-time precision/recall splits to detect data shifts, prioritize data collection, and retrain models.

ABOUT THE SPEAKER: Ivan Aguilar is a data scientist at Teleskope focused on building scalable models for detecting PII/PHI/Secrets and other compliance related entities within customers' clouds. Prior to joining Teleskope, Ivan was a ML Engineer at Forge.AI, a Boston based shop working on information extraction, content extraction, and other NLP related tasks.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Everything I Know About Data Science I Learned from Model Railroading | Near

ABOUT THE TALK: Data scientists build models of the real world using 1s and 0s. Model Railroaders build models of the real world using plastic and metal. In the end, they’re both models and Model Railroaders have been at it way longer than we DS have. Let’s look at parallel concepts like overfitting versus the 10 foot rule, synthetic data versus prototype freelancing, or assumptions versus modeler’s license and see what lessons from other realms of model building we can bring home to DS.

ABOUT THE SPEAKER Peter Lenz (he, him) is a Geographer and Data Scientist who combines a deep domain expertise in geoinformatics and economic geography with technical skills in programming, machine learning, NLP, among others. Peter is working to create 'Big Social Science'.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Generative AI for Search | Tonita

ABOUT THE TALK: D. Sivakumar discusses the evolving -- and immensely powerful -- role that generative AI methods, especially in NLP and Vision, play in Search, broadly construed. Through a number of anecdotes and organizing principles, he highlights a handful of key challenges and promising directions.

ABOUT THE SPEAKER: D. Sivakumar (Siva) is co-founder and CEO of Tonita.co, whose mission is to bring fluent natural-language search to every search box on the Web. Prior to founding Tonita in 2021, he worked in the research organizations at Google, Yahoo!, and IBM. His research has spanned algorithms and complexity, web search, and deep learning.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

How ML, NLP, & 5 mins of Playtime Help Parents, Caregivers, & Children Enjoy Life Together

ABOUT THE TALK: One in five kids has a mental or behavioral disorder, but only 15% have access to care, and the current supply of trained therapists barely covers that demand. Happypillar is a digital therapeutic app that provides evidence-proven behavioral intervention to all at scale. Learn how we combine ML, ASR, NLP, and other technologies with the expertise of our founding clinical play therapist to offer accurate and real-time personalized feedback, all with compliant security processes and the strictest privacy controls.

ABOUT THE SPEAKER: Mady Mantha is a Product and ML Engineering Leader and the Co-Founder & CTO at Happypillar. As a Director of Conversational AI at Sirius, Mady led the team that built Walmart’s conversational AI.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Building a better world with AI, one architectural drawing at a time | mbue

ABOUT THE TALK: Mbue uses advanced computer vision and NLP technologies to read and understand architectural and technical drawings, catch flaws and other mistakes that cause delays and costly fixes, and ultimately automate the quality control process. Learn how they approach this complex problem, what they are doing to solve it, and where we're going next.

ABOUT THE SPEAKERS: Jean-Pierre Trou is the CEO and co-founder of mbue, a SaaS AI-First company focused on saving time, money and reducing liability for Architecture, Engineering and Construction (AEC) companies with automated quality control tools. mbue's web-based application utilizes Artificial Intelligence to instantly review technical drawings. Think “autocorrect” for construction documents. Jean Pierre is also the Founding Principal of Runa Workshop, Architecture and Interior design firm, and Founding Partner at Vaast, a real estate company, all based in Austin, Texas.

Ron Green is a serial tech entrepreneur and expert in artificial intelligence. Ron is co-founder and CTO of mbue and also co-founded KUNGFU.AI, an AI consultancy that helps companies build and deploy AI and machine learning solutions. Prior to KUNGFU.AI, Ron was CEO and founder of Thrive Technologies (acquired by CLOUD), ran software development at Ziften Technologies, Powered (acquired by Dachis Group), and Visible Genetics (acquired by Bayer).

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Building an ML Experimentation Platform for Easy Reproducibility | Treeverse

ABOUT THE TALK: Quality ML at scale is only possible when we can reproduce a specific iteration of the ML experiment–and this is where data is key.

In this talk, you will learn how to use a data versioning engine to intuitively and easily version your ML experiments and reproduce any specific iteration of the experiment.

This talk will demo through a live code example: -Creating a basic ML experimentation framework with lakeFS (on Jupyter notebook) -Reproducing ML components from a specific iteration of an experiment Building intuitive, zero-maintenance experiments infrastructure -All with common data engineering stacks & open source tooling.

ABOUT THE SPEAKER: Vino Duraisamy is a developer advocate at lakeFS, an open-source platform that delivers git-like experience to object store based data lakes. She has previously worked at NetApp (on data management applications for NetApp data centers), on data teams of Nike and Apple, where she worked mainly on batch processing workloads as a data engineer, built custom NLP models as an ML engineer and even touched upon MLOps a bit for model deployments.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

We talked about:

Johannes’s background Johannes’s Open Source Spotlight demos – Refinery and Bricks The difficulties of working with natural language processing (NLP) Incorporating ChatGPT into a process as a heuristic What is Bricks? The process of starting a startup – Kern Making the decision to go with open source Pros and cons of launching as open source Kern’s business model Working with enterprises Johannes as a salesperson The team at Kern Johannes’s role at Kern How Johannes and Henrik separate responsibilities at Kern Working with very niche use cases The short story of how Kern got its funding Johannes’s resource recommendation

Links:

Refinery's GitHub repo: https://github.com/code-kern-ai/refinery Bricks' Github repo: https://github.com/code-kern-ai/bricks Bricks Open Source Spotlight demo: https://www.youtube.com/watch?v=r3rXzoLQy2U Refinery Open Source Spotlight demo: https://www.youtube.com/watch?v=LlMhN2f7YDg Discord: https://discord.com/invite/qf4rGCEphW Ker's Website: https://www.kern.ai

Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html