talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

1217

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

1217 activities · Newest first

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk.

This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform.

You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.

In an era where data complexity and scale challenge every organization, manual intervention can no longer keep pace. Prizm by DQLabs redefines the paradigm—offering a no-touch, agentic data platform that seamlessly integrates Data Quality, Observability, and Semantic Intelligence into one self-learning, self-optimizing ecosystem.

Unlike legacy systems Prizm is AI native, it is Agentic by Design, built from the ground up around a network of intelligent, role-driven agents that observe, recommend, act, and learn in concert to deliver continuous, autonomous data trust.

Join us at Big Data London to Discover how Prizm’s agent-driven anomaly detection, data quality enforcement, and deep semantic analysis set a new industry standard—shifting data and AI trust from an operational burden to a competitive advantage that powers actionable, insight-driven outcomes.

Face To Face
by Aaron Baker (Multiverse) , Jane Crowe (UK Ministry of Defence) , Kash Nejad (Multiverse)

Multiverse is proud to host the Ministry of Defence (MOD) on stage at Big Data LDN to discuss their pioneering partnership focused on building data skills and capabilities across the defence sector. As organisations worldwide navigate the transformative potential of AI and advanced analytics, investing in staff development has become a strategic imperative. This partnership is already making tangible impact: over 250 MOD employees are currently enrolled in upskilling programmes designed to strengthen data literacy, enhance analytical capabilities, and embed a culture of continuous learning. The initiative equips personnel to leverage data effectively, driving smarter decision-making and supporting the MOD’s ongoing Strategic Defence Reform agenda.

Speakers will share insights into how targeted learning interventions and personalised development pathways can accelerate organisational capability while delivering measurable outcomes. Attendees will hear first-hand how the collaboration between Multiverse and the MOD has delivered early successes, fostered a growth mindset among staff, and positioned the MOD to scale these programmes far beyond their current reach. This session offers a unique opportunity for leaders and practitioners alike to explore the intersection of talent investment, AI adoption, and data-driven transformation, demonstrating how strategic upskilling can future-proof organisations in an increasingly complex data landscape.

In this short presentation, Big Data LDN Conference Chairman and Europe’s leading IT Industry Analyst in Data Management and Analytics, Mike Ferguson, will welcome everyone to Big Data LDN 2025. He will also summarise where companies are in data, analytics and AI in 2025, what the key challenges and trends are, how are these trends impacting on how companies build a data-driven enterprise and where you can find out more about these at the show.

It’s now over six years since the emergence of the paper "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” by Zhamak Dehghani that had a major impact on the data and analytics industry. 

It highlighted major data architecture failures and called for a rethink in data architecture and in data provisioning by creating a data supply chain and democratising data engineering to enable business domain-oriented creation of reusable data products to make data products available as self-governing services. 

Since then, we have seen many companies adopt Data Mesh strategies, and the repositioning of some software products as well as the emergence of new ones to emphasize democratisation. But is what has happened since totally addressing the problems that Data Mesh was intending to solve? And what new problems are arising as organizations try to make data safely available to AI projects at machine-scale?  

In this unmissable session Big Data LDN Chair Mike Ferguson sits down with Zhamak Dehghani to talk about what has happened since Data Mesh emerged. It will look at:

● The drivers behind Data Mesh

● Revisiting Data Mesh to clarify on what a data product is and what Data Mesh is intending to solve

● Did data architecture really change or are companies still using existing architecture to implement this?

● What about technology to support this - Is Data Fabric the answer or best of breed tools? 

● How critical is organisation to successful Data Mesh implementation

● Roadblocks in the way of success e.g., lack of metadata standards

● How does Data Mesh impact AI?

● What’s next on the horizon?

Face To Face
by Mike Ferguson (Big Data LDN)

A welcome address to the first edition of Data Driven LDN, from Big Data LDN Chair, Mike Ferguson.

In this brief introduction, Mike will outline the agenda for the day, and introduce the key topics that will be discussed.

Advances in Artificial Intelligence Applications in Industrial and Systems Engineering

Comprehensive guide offering actionable strategies for enhancing human-centered AI, efficiency, and productivity in industrial and systems engineering through the power of AI. Advances in Artificial Intelligence Applications in Industrial and Systems Engineering is the first book in the Advances in Industrial and Systems Engineering series, offering insights into AI techniques, challenges, and applications across various industrial and systems engineering (ISE) domains. Not only does the book chart current AI trends and tools for effective integration, but it also raises pivotal ethical concerns and explores the latest methodologies, tools, and real-world examples relevant to today’s dynamic ISE landscape. Readers will gain a practical toolkit for effective integration and utilization of AI in system design and operation. The book also presents the current state of AI across big data analytics, machine learning, artificial intelligence tools, cloud-based AI applications, neural-based technologies, modeling and simulation in the metaverse, intelligent systems engineering, and more, and discusses future trends. Written by renowned international contributors for an international audience, Advances in Artificial Intelligence Applications in Industrial and Systems Engineering includes information on: Reinforcement learning, computer vision and perception, and safety considerations for autonomous systems (AS) (NLP) topics including language understanding and generation, sentiment analysis and text classification, and machine translation AI in healthcare, covering medical imaging and diagnostics, drug discovery and personalized medicine, and patient monitoring and predictive analysis Cybersecurity, covering threat detection and intrusion prevention, fraud detection and risk management, and network security Social good applications including poverty alleviation and education, environmental sustainability, and disaster response and humanitarian aid. Advances in Artificial Intelligence Applications in Industrial and Systems Engineering is a timely, essential reference for engineering, computer science, and business professionals worldwide.

Bribus ontwikkelt, produceert en monteert duurzame, betaalbare keukens voor de zakelijke markt. Jaarlijks levert Bribus ruim 60.000 keukens. Tijdens de Big Data Expo deelt CEO Wim Diersen hoe Bribus data inzet voor innovatie en klantgerichtheid. Met het Domo-platform en partner Metriqzz kreeg Bribus snel grip op operationele en commerciële data. Een presentatie vol praktijkvoorbeelden, uitdagingen, inzichten en concrete resultaten.

The Definitive Guide to OpenSearch

Learn how to harness the power of OpenSearch effectively with 'The Definitive Guide to OpenSearch'. This book explores installation, configuration, query building, and visualization, guiding readers through practical use cases and real-world implementations. Whether you're building search experiences or analyzing data patterns, this guide equips you thoroughly. What this Book will help me do Understand core OpenSearch principles, architecture, and the mechanics of its search and analytics capabilities. Learn how to perform data ingestion, execute advanced queries, and produce insightful visualizations on OpenSearch Dashboards. Implement scaling strategies and optimum configurations for high-performance OpenSearch clusters. Explore real-world case studies that demonstrate OpenSearch applications in diverse industries. Gain hands-on experience through practical exercises and tutorials for mastering OpenSearch functionality. Author(s) Jon Handler, Soujanya Konka, and Prashant Agrawal, celebrated experts in search technologies and big data analysis, bring their years of experience at AWS and other domains to this book. Their collective expertise ensures that readers receive both core theoretical knowledge and practical applications to implement directly. Who is it for? This book is aimed at developers, data professionals, engineers, and systems operators who work with search systems or analytics platforms. It is especially suitable for individuals in roles handling large-scale data, who want to improve their skills or deploy OpenSearch in production environments. Early learners and seasoned experts alike will find valuable insights.

podcast_episode
by Jack Van Horn (University of Virginia) , Lulu Jiang (University of Virginia) , Thomas Hartung (Johns Hopkins University)

In this episode, we’re diving into a fascinating intersection of cutting-edge science and data innovation. As technology continues to evolve, researchers are increasingly turning to brain organoids, (miniature, lab-grown models of the human brain) to unravel some of the most complex mysteries of neuroscience. We’re joined by three brain organoid experts: Thomas Hartung, Professor of Environmental Health and Engineering at Johns Hopkins University; Jack Van Horn, Professor of Data Science and Psychology at the University of Virginia;  and Lulu Jiang, Assistant Professor of Neuroscience, also at the University of Virginia. Together, they’ll shed light on how brain organoid technology is reshaping our understanding of the brain, and how data science is playing a crucial role in unlocking its secrets.

Chapters (00:00:51) - Brain Organizations(00:05:54) - Brain Organoids for drug discovery and immunology(00:13:53) - Alzheimer's disease in the organoid system(00:15:49) - What are the standards in the field of brain organoids?(00:22:44) - Big Data and Intelligence in the Brain(00:26:50) - Alzheimer's disease, the human brain(00:30:39) - The computational twin of the brain(00:37:23) - The quest for precision medicine in the brain(00:42:17) - The human brain in an organoid(00:43:21) - Will Brain Derived Organoids Replace Animal Models in Neurodegener

Summary In this episode of the Data Engineering Podcast Kacper Łukawski from Qdrant about integrating MCP servers with vector databases to process unstructured data. Kacper shares his experience in data engineering, from building big data pipelines in the automotive industry to leveraging large language models (LLMs) for transforming unstructured datasets into valuable assets. He discusses the challenges of building data pipelines for unstructured data and how vector databases facilitate semantic search and retrieval-augmented generation (RAG) applications. Kacper delves into the intricacies of vector storage and search, including metadata and contextual elements, and explores the evolution of vector engines beyond RAG to applications like semantic search and anomaly detection. The conversation covers the role of Model Context Protocol (MCP) servers in simplifying data integration and retrieval processes, highlighting the need for experimentation and evaluation when adopting LLMs, and offering practical advice on optimizing vector search costs and fine-tuning embedding models for improved search quality.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Kacper Łukawski about how MCP servers can be paired with vector databases to streamline processing of unstructured dataInterview IntroductionHow did you get involved in the area of data management?LLMs are enabling the derivation of useful data assets from unstructured sources. What are the challenges that teams face in building the pipelines to support that work?How has the role of vector engines grown or evolved in the past ~2 years as LLMs have gained broader adoption?Beyond its role as a store of context for agents, RAG, etc. what other applications are common for vector databaes?In the ecosystem of vector engines, what are the distinctive elements of Qdrant?How has the MCP specification simplified the work of processing unstructured data?Can you describe the toolchain and workflow involved in building a data pipeline that leverages an MCP for generating embeddings?helping data engineers gain confidence in non-deterministic workflowsbringing application/ML/data teams into collaboration for determining the impact of e.g. chunking strategies, embedding model selection, etc.What are the most interesting, innovative, or unexpected ways that you have seen MCP and Qdrant used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on vector use cases?When is MCP and/or Qdrant the wrong choice?What do you have planned for the future of MCP with Qdrant?Contact Info LinkedInTwitter/XPersonal websiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links QdrantKafkaApache OoziNamed Entity RecognitionGraphRAGpgvectorElasticsearchApache LuceneOpenSearchBM25Semantic SearchMCP == Model Context ProtocolAnthropic Contextualized ChunkingCohereThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

This tutorial will explore GPU-accelerated clustering techniques using RAPIDS cuML, optimizing algorithms like K-Means, DBSCAN, and HDBSCAN for large datasets. Traditional clustering methods struggle with scalability, but GPU acceleration significantly enhances performance and efficiency.

Participants will learn to leverage dimensionality reduction techniques (PCA, T-SNE, UMAP) for better data visualization and apply hyperparameter tuning with Optuna and cuML. The session also includes real-world applications like topic modeling in NLP and customer segmentation. By the end, attendees will be equipped to implement, optimize, and scale clustering algorithms effectively, unlocking faster and more powerful insights in machine learning workflows.

Ed Digby will share lessons from building a data platform that processes 17 billion rows of transaction data across 12,000 UK convenience stores, focusing on delivering fast, simple, actionable insights while managing startup constraints and costs. The talk covers cutting through complexity, focusing on what matters, and building data systems that deliver real value.

Daft and Unity Catalog: A Multimodal/AI-Native Lakehouse

Modern data organizations have moved beyond big data analytics to also incorporate advanced AI/ML data workloads. These workflows often involve multimodal datasets containing documents, images, long-form text, embeddings, URLs and more. Unity Catalog is an ideal solution for organizing and governing this data at scale. When paired with the Daft open source data engine, you can build a truly multimodal, AI-ready data lakehouse. In this session, we’ll explore how Daft integrates with Unity Catalog’s core features (such as volumes and functions) to enable efficient, AI-driven data lakehouses. You will learn how to ingest and process multimodal data (images, text and videos), run AI/ML transformations and feature extractions at scale, and maintain full control and visibility over your data with Unity Catalog’s fine-grained governance.

Scaling Data Engineering Pipelines: Preparing Credit Card Transactions Data for Machine Learning

We discuss two real-world use cases in big data engineering, focusing on constructing stable pipelines and managing storage at a petabyte scale. The first use case highlights the implementation of Delta Lake to optimize data pipelines, resulting in an 80% reduction in query time and a 70% reduction in storage space. The second use case demonstrates the effectiveness of the Workflows ‘ForEach’ operator in executing compute-intensive pipelines across multiple clusters, significantly reducing processing time from months to days. This approach involves a reusable design pattern that isolates notebooks into units of work, enabling data scientists to independently test and develop.

lightning_talk
by DB Tsai (Databricks) , Jules S. Damji (Anyscale Inc) , Allison Wang (Databricks)

Join us for an interactive Ask Me Anything (AMA) session on the latest advancements in Apache Spark 4, including Spark Connect — the new client-server architecture enabling seamless integration with IDEs, notebooks and custom applications. Learn about performance improvements, enhanced APIs and best practices for leveraging Spark’s next-generation features. Whether you're a data engineer, Spark developer or big data enthusiast, bring your questions on architecture, real-world use cases and how these innovations can optimize your workflows. Don’t miss this chance to dive deep into the future of distributed computing with Spark!

How FedEx Achieved Self-Serve Analytics and Data Democratization on Databricks

FedEx, a global leader in transportation and logistics, faced a common challenge in the era of big data: how to democratize data and foster data-driven decision making with thousands of data practitioners at FedEx wanting to build models, get real-time insights, explore enterprise data, and build enterprise-grade solutions to run the business. This breakout session will highlight how FedEx overcame challenges in data governance and security using Unity Catalog, ensuring that sensitive information remains protected while still allowing appropriate access across the organization. We'll share their approach to building intuitive self-service interfaces, including the use of natural-language processing to enable non-technical users to query data effortlessly. The tangible outcomes of this initiative are numerous, but chiefly: increased data literacy across the company, faster time-to-insight for business decisions, and significant cost-savings through improved operational efficiency.

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Leveraging GenAI for Synthetic Data Generation to Improve Spark Testing and Performance in Big Data

Testing Spark jobs in local environments is often difficult due to the lack of suitable datasets, especially under tight timelines. This creates challenges when jobs work in development clusters but fail in production, or when they run locally but encounter issues in staging clusters due to inadequate documentation or checks. In this session, we’ll discuss how these challenges can be overcome by leveraging Generative AI to create custom synthetic datasets for local testing. By incorporating variations and sampling, a testing framework can be introduced to solve some of these challenges, allowing for the generation of realistic data to aid in performance and load testing. We’ll show how this approach helps identify performance bottlenecks early, optimize job performance and recognize scalability issues while keeping costs low. This methodology fosters better deployment practices and enhances the reliability of Spark jobs across environments.