By introducing a range of AI-enhanced products that amplify creativity and interactivity across our platforms, Buzzfeed has been able to connect with the largest global audience of young people online to cement its role as the defining digital media company of the AI era. Notably, some of Buzzfeed's most successful tools and content experiences thrive on the power of small, focused datasets. Still wondering how Shrek fits into the picture? You'll have to watch!
Video from: https://smalldatasf.com/
📓 Resources
Big Data is Dead: https://motherduck.com/blog/big-data-...
Small Data Manifesto: https://motherduck.com/blog/small-dat...
Why Small Data?: https://benn.substack.com/p/is-excel-...
Small Data SF: https://www.smalldatasf.com/
➡️ Follow Us
LinkedIn: / motherduck
X/Twitter : / motherduck
Bluesky: motherduck.com
Blog: https://motherduck.com/blog/
Discover how BuzzFeed's Data team, led by Gilad Cohen, harnesses AI for creative purposes, leveraging large language models (LLMs) and generative image capabilities to enhance content creation. This video explores how machine learning teams build tools to create new interactive media experiences, focusing on augmenting creative workflows rather than replacing jobs, allowing readers to participate more deeply in the content they consume.
We dive into the core data science problem of understanding what a piece of content is about, a crucial step for improving content recommendation systems. Learn why traditional methods fall short and how the team is constantly seeking smaller, faster, and more performant models. This exploration covers the evolution from earlier architectures like DistilBERT to modern, more efficient approaches for better content representation, clustering, and user personalization.
A key technique explored is the use of text embeddings, which are dense, low-dimensional vector representations of data. This video provides an accessible explanation of embeddings as a form of compressed knowledge, showing how BuzzFeed creates a unique vector for each article. This allows for simple vector math to find semantically similar content, forming a foundational infrastructure for powerful ranking and recommender systems.
Explore how BuzzFeed leverages generative image capabilities to create new interactive formats. The journey began with Midjourney experiments and evolved to building custom tools by fine-tuning a Stable Diffusion XL model using LORA (Low-Rank Approximation). This advanced technique provides greater control over image output, enabling the rapid creation of viral AI generators that respond to trending topics and allow for massive user engagement.
Finally, see a practical application of machine learning for content optimization. BuzzFeed uses its vast historical dataset from Bayesian A/B testing to train a model that predicts headline performance. By generating multiple headline candidates with an LLM like Claude and running them through this predictive model, they can identify the winning headline. This showcases how to use unique, in-house data to build powerful tools that improve click-through rates and drive engagement, pointing to a significant transformation in how media is created and consumed.