talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

1217

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

1217 activities · Newest first

Think Inside the Box: Constraints Drive Data Warehousing Innovation

As a Head of Data or a one-person data team, keeping the lights on for the business while running all things data-related as efficiently as possible is no small feat. This talk will focus on tactics and strategies to manage within and around constraints, including monetary costs, time and resources, and data volumes.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Why Small Data?: https://benn.substack.com/p/is-excel-... Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: / motherduck
X/Twitter : / motherduck
Blog: https://motherduck.com/blog/


Learn how your data team can drive innovation and maximize ROI by embracing constraints, drawing inspiration from SpaceX's revolutionary cost-effective approach. This video challenges the "abundance mindset" prevalent in the modern data stack, where easily scalable cloud data warehouses and a surplus of tools often lead to unmanageable data models and underutilized dashboards. We explore a focused data strategy for extracting maximum value from small data, shifting the paradigm from "more data" to more impact.

To maximize value, data teams must move beyond being order-takers and practice strategic stakeholder management. Discover how to use frameworks like the stakeholder engagement matrix to prioritize high-impact business leaders and align your work with core business goals. This involves speaking the language of business growth models, not technical jargon about data pipelines or orchestration, ensuring your data engineering efforts resonate with key decision-makers and directly contribute to revenue-generating activities.

Embracing constraints is key to innovation and effective data project management. We introduce the Iron Triangle—a fundamental engineering concept balancing scope, cost, and time—as a powerful tool for planning data projects and having transparent conversations with the business. By treating constraints not as limitations but as opportunities, data engineers and analysts can deliver higher-quality data products without succumbing to scope creep or uncontrolled costs.

A critical component of this strategy is understanding the Total Cost of Ownership (TCO), which goes far beyond initial compute costs to include ongoing maintenance, downtime, and the risk of vendor pricing changes. Learn how modern, efficient tools like DuckDB and MotherDuck are designed for cost containment from the ground up, enabling teams to build scalable, cost-effective data platforms. By making the true cost of data requests visible, you can foster accountability and make smarter architectural choices. Ultimately, this guide provides a blueprint for resisting data stack bloat and turning cost and constraints into your greatest assets for innovation.

Is BI Too Big for Small Data?

This is a talk about how we thought we had Big Data, and we built everything planning for Big Data, but then it turns out we didn't have Big Data, and while that's nice and fun and seems more chill, it's actually ruining everything, and I am here asking you to please help us figure out what we are supposed to do now.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Is Excel Immortal?: https://benn.substack.com/p/is-excel-immortal Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: / motherduck
X/Twitter : / motherduck
Blog: https://motherduck.com/blog/


Mode founder David Wheeler challenges the data industry's obsession with "big data," arguing that most companies are actually working with "small data," and our tools are failing us. This talk deconstructs the common sales narrative for BI tools, exposing why the promise of finding game-changing insights through data exploration often falls flat. If you've ever built dashboards nobody uses or wondered why your analytics platform doesn't deliver on its promises, this is a must-watch reality check on the modern data stack.

We explore the standard BI demo, where an analyst uncovers a critical insight by drilling into event data. This story sells tools like Tableau and Power BI, but it rarely reflects reality, leading to a "revolving door of BI" as companies swap tools every few years. Discover why the narrative of the intrepid analyst finding a needle in the haystack only works in movies and how this disconnect creates a cycle of failed data initiatives and unused "trashboards."

The presentation traces our belief that "data is the new oil" back to the early 2010s, with examples from Target's predictive analytics and Facebook's growth hacking. However, these successes were built on truly massive datasets. For most businesses, analyzing small data results in noisy charts that offer vague "directional vibes" rather than clear, actionable insights. We contrast the promise of big data analytics with the practical challenges of small data interpretation.

Finally, learn actionable strategies for extracting real value from the data you actually have. We argue that BI tools should shift focus from data exploration to data interpretation, helping users understand what their charts actually mean. Learn why "doing things that don't scale," like manually analyzing individual customer journeys, can be more effective than complex models for small datasets. This talk offers a new perspective for data scientists, analysts, and developers looking for better data analysis techniques beyond the big data hype.

Big Data is Dead: Long Live Hot Data 🔥

Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: Simplifying our work.

Some of us may have Big Data, but our quest to derive insights from it is measured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-is-dead/ Small Data Manifesto: https://motherduck.com/blog/small-data-manifesto/ Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: https://linkedin.com/company/motherduck X/Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/


Explore the "Small Data" movement, a counter-narrative to the prevailing big data conference hype. This talk challenges the assumption that data scale is the most important feature of every workload, defining big data as any dataset too large for a single machine. We'll unpack why this distinction is crucial for modern data engineering and analytics, setting the stage for a new perspective on data architecture.

Delve into the history of big data systems, starting with the non-linear hardware costs that plagued early data practitioners. Discover how Google's foundational papers on GFS, MapReduce, and Bigtable led to the creation of Hadoop, fundamentally changing how we scale data processing. We'll break down the "big data tax"—the inherent latency and system complexity overhead required for distributed systems to function, a critical concept for anyone evaluating data platforms.

Learn about the architectural cornerstone of the modern cloud data warehouse: the separation of storage and compute. This design, popularized by systems like Snowflake and Google BigQuery, allows storage to scale almost infinitely while compute resources are provisioned on-demand. Understand how this model paved the way for massive data lakes but also introduced new complexities and cost considerations that are often overlooked.

We examine the cracks appearing in the big data paradigm, especially for OLAP workloads. While systems like Snowflake are still dominant, the rise of powerful alternatives like DuckDB signals a shift. We reveal the hidden costs of big data analytics, exemplified by a petabyte-scale query costing nearly $6,000, and argue that for most use cases, it's too expensive to run computations over massive datasets.

The key to efficient data processing isn't your total data size, but the size of your "hot data" or working set. This talk argues that the revenge of the single node is here, as modern hardware can often handle the actual data queried without the overhead of the big data tax. This is a crucial optimization technique for reducing cost and improving performance in any data warehouse.

Discover the core principles for designing systems in a post-big data world. We'll show that since only 1 in 500 users run true big data queries, prioritizing simplicity over premature scaling is key. For low latency, process data close to the user with tools like DuckDB and SQLite. This local-first approach offers a compelling alternative to cloud-centric models, enabling faster, more cost-effective, and innovative data architectures.

Intelligent Data Analytics for Bioinformatics and Biomedical Systems

The book analyzes the combination of intelligent data analytics with the intricacies of biological data that has become a crucial factor for innovation and growth in the fast-changing field of bioinformatics and biomedical systems. Intelligent Data Analytics for Bioinformatics and Biomedical Systems delves into the transformative nature of data analytics for bioinformatics and biomedical research. It offers a thorough examination of advanced techniques, methodologies, and applications that utilize intelligence to improve results in the healthcare sector. With the exponential growth of data in these domains, the book explores how computational intelligence and advanced analytic techniques can be harnessed to extract insights, drive informed decisions, and unlock hidden patterns from vast datasets. From genomic analysis to disease diagnostics and personalized medicine, the book aims to showcase intelligent approaches that enable researchers, clinicians, and data scientists to unravel complex biological processes and make significant strides in understanding human health and diseases. This book is divided into three sections, each focusing on computational intelligence and data sets in biomedical systems. The first section discusses the fundamental concepts of computational intelligence and big data in the context of bioinformatics. This section emphasizes data mining, pattern recognition, and knowledge discovery for bioinformatics applications. The second part talks about computational intelligence and big data in biomedical systems. Based on how these advanced techniques are utilized in the system, this section discusses how personalized medicine and precision healthcare enable treatment based on individual data and genetic profiles. The last section investigates the challenges and future directions of computational intelligence and big data in bioinformatics and biomedical systems. This section concludes with discussions on the potential impact of computational intelligence on addressing global healthcare challenges. Audience Intelligent Data Analytics for Bioinformatics and Biomedical Systems is primarily targeted to professionals and researchers in bioinformatics, genetics, molecular biology, biomedical engineering, and healthcare. The book will also suit academicians, students, and professionals working in pharmaceuticals and interpreting biomedical data.

As AI continually changes how businesses operate, new questions emerge around ethics and privacy. Nowadays, algorithms can set prices and personalize offers, but how do companies ensure they’re doing this responsibly? What does it mean to be transparent with customers about data use, and how can businesses avoid unintended bias? Balancing innovation with trust is key, but achieving this balance isn’t always straightforward. Dr. Jose Mendoza is Academic Director and Clinical Associate Professor in Integrated Marketing at NYU, and was formerly an Associate Professor of Practice at The University of Arizona in Tucson, Arizona. His focus is on consumer pricing, digital retailing, intelligent retail stores, neuromarketing, big data, artificial intelligence, and machine learning. Previously, he taught marketing courses at Sacred Heart University and Western Michigan University. He is also an experienced senior global marketing executive with over 18 years of experience in global marketing alone and a career as an Engineer in Information Sciences. Dr. Mendoza is also a Doctoral Researcher in Strategic and Global pricing, Consumer Behavior, and Pricing Research methodologies. He had international roles in Latin America, Europe, and the USA with scope in over 50 countries.  In the episode, Richie and Jose explore AI-driven pricing, consumer perceptions and ethical pricing, the complexity of dynamic pricing models, explainable AI, data privacy and customer trust, legal and ethical guardrails, innovations in dynamic pricing and much more.  Links Mentioned in the Show: NYUConnect with JoseAmazon Dynamic Pricing Strategy in 2024Course: AI EthicsRelated Episode: The Future of Marketing Analytics with Cory Munchbach, CEO at BlueConicSign up to RADAR: Forward Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Apache Spark for Machine Learning

Dive into the power of Apache Spark as a tool for handling and processing big data required for machine learning. With this book, you will explore how to configure, execute, and deploy machine learning algorithms using Spark's scalable architecture and learn best practices for implementing real-world big data solutions. What this Book will help me do Understand the integration of Apache Spark with large-scale infrastructures for machine learning applications. Employ data processing techniques for preprocessing and feature engineering efficiently with Spark. Master the implementation of advanced supervised and unsupervised learning algorithms using Spark. Learn to deploy machine learning models within Spark ecosystems for optimized performance. Discover methods for analyzing big data trends and machine learning model tuning for improved accuracy. Author(s) The author, Deepak Gowda, is an experienced data scientist with over ten years of expertise in machine learning and big data. His career spans industries such as supply chain, cybersecurity, and more where he has utilized Apache Spark extensively. Deepak's teaching style is marked by clarity and practicality, making complex concepts approachable. Who is it for? Apache Spark for Machine Learning is tailored for data engineers, machine learning practitioners, and computer science students looking to advance their ability to process, analyze, and model using large datasets. If you're already familiar with basic machine learning and want to scale your solutions using Spark, this book is ideal for your studies and professional growth.

Building a robust data infrastructure is crucial for any organization looking to leverage AI and data-driven insights. But as your data ecosystem grows, so do the challenges of managing, securing, and scaling it. How do you ensure that your data infrastructure not only meets today’s needs but is also prepared for the rapid changes in technology tomorrow? What strategies can you adopt to keep your organization agile, while ensuring that your data investments continue to deliver value and support business goals? Saad Siddiqui is a venture capitalist for Titanium Ventures. Titanium focus on enterprise technology investments, particularly focusing on next generation enterprise infrastructure and applications. In his career, Saad has deployed over $100M in venture capital in over a dozen companies. In previous roles as a corporate development executive, he has executed M&A transactions valued at over $7 billion in aggregate. Prior to Titanium Ventures he was in corporate development at Informatica and was a member of Cisco's venture investing and acquisitions team covering cloud, big data and virtualization.  In the episode, Richie and Saad explore the business impacts of data infrastructure, getting started with data infrastructure, the roles and teams you need to get started, scalability and future-proofing, implementation challenges, continuous education and flexibility, automation and modernization, trends in data infrastructure, and much more.  Links Mentioned in the Show: Titanium VenturesConnect with SaadCourse - Artificial Intelligence (AI) StrategyRelated Episode: How are Businesses Really Using AI? With Tathagat Varma, Global TechOps Leader at Walmart Global TechRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Businesses are collecting more data than ever before. But is bigger always better? Many companies are starting to question whether massive datasets and complex infrastructure are truly delivering results or just adding unnecessary costs and complications. How can you make sure your data strategy is aligned with your actual needs? What if focusing on smaller, more manageable datasets could improve your efficiency and save resources, all while delivering the same insights? Ryan Boyd is the Co-Founder & VP, Marketing + DevRel at MotherDuck. Ryan started his career as a software engineer, but since has led DevRel teams for 15+ years at Google, Databricks and Neo4j, where he developed and executed numerous marketing and DevRel programs. Prior to MotherDuck, Ryan worked at Databricks and focussed the team on building an online community during the pandemic, helping to organize the content and experience for an online Data + AI Summit, establishing a regular cadence of video and blog content, launching the Databricks Beacons ambassador program, improving the time to an “aha” moment in the online trial and launching a University Alliance program to help professors teach the latest in data science, machine learning and data engineering. In the episode, Richie and Ryan explore data growth and computation, the data 1%, the small data movement, data storage and usage, the shift to local and hybrid computing, modern data tools, the challenges of big data, transactional vs analytical databases, SQL language enhancements, simple and ergonomic data solutions and much more.  Links Mentioned in the Show: MotherDuckThe Small Data ManifestoConnect with RyanSmall DataSF conferenceRelated Episode: Effective Data Engineering with Liya Aizenberg, Director of Data Engineering at AwayRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Will AI completely revolutionize the way we work as data professionals? Or is it overhyped? In this episode, Lindsay Murphy and Colleen Tartow will take opposing viewpoints and help us understand whether or not AI can really live up to all the hype. You'll leave with a deeper understanding of the current state of AI in data, the tech stack needed to run AI, and where things are heading in the future.   What You'll Learn: The tech stack required to run AI and how it differs from prior "big data" stacks Will AI change everything in data? Or is it overhyped? How you should be thinking about AI and its impact on your career   Register for free to be part of the next live session: https://bit.ly/3XB3A8b   About our guests: Lindsay Murphy is the host of the Women Lead Data podcast as well as the Head of Data at Hiive. Follow Lindsay on LinkedIn  

Colleen Tartow is an engineering and data leader, author, speaker, advisor, mentor, and DEI Advocate. Data Mesh for Dummies E-Book Follow Colleen on LinkedIn   Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter

Data Engineering Best Practices

Unlock the secrets to building scalable and efficient data architectures with 'Data Engineering Best Practices.' This book provides in-depth guidance on designing, implementing, and optimizing cloud-based data pipelines. You will gain valuable insights into best practices, agile workflows, and future-proof designs. What this Book will help me do Effectively plan and architect scalable data solutions leveraging cloud-first strategies. Master agile processes tailored to data engineering for improved project outcomes. Implement secure, efficient, and reliable data pipelines optimized for analytics and AI. Apply real-world design patterns and avoid common pitfalls in data flow and processing. Create future-ready data engineering solutions following industry-proven frameworks. Author(s) Richard J. Schiller and David Larochelle are seasoned data engineering experts with decades of experience crafting efficient and secure cloud-based infrastructures. Their collaborative writing distills years of real-world expertise into practical advice aimed at helping engineers succeed in a rapidly evolving field. Who is it for? This book is ideal for data engineers, ETL specialists, and big data professionals seeking to enhance their knowledge in cloud-based solutions. Some familiarity with data engineering, ETL pipelines, and big data technologies is helpful. It suits those keen on mastering advanced practices, improving agility, and developing efficient data pipelines. Perfect for anyone looking to future-proof their skills in data engineering.

The Data Product Management In Action podcast, brought to you by Soda and executive producer Scott Hirleman, is a platform for data product management practitioners to share insights and experiences.  In Season 01, Episode 19, host Nadiem von Heydebrand interviews Pradeep Fernando, who leads the data and metadata management initiative at Swisscom. They explore key topics in data product management, including the definition and categorization of data products, the role of AI, prioritization strategies, and the application of product management principles. Pradeep shares valuable insights and experiences on successfully implementing data product management within organizations. About our host Nadiem von Heydebrand: Nadiem is the CEO and Co-Founder of Mindfuel. In 2019, he merged his passion for data science with product management, becoming a thought leader in data product management. Nadiem is dedicated to demonstrating the true value contribution of data. With over a decade of experience in the data industry, Nadiem leverages his expertise to scale data platforms, implement data mesh concepts, and transform AI performance into business performance, delighting consumers at global organizations that include Volkswagen, Munich Re, Allianz, Red Bull, and Vorwerk. Connect with Nadiem on LinkedIn. About our guest Pradeep Fernando: Pradeep is a seasoned data product leader with over 6 years of data product leadership experience and over 10 years of product management experience. He leads or is a key contributor to several company-wide data & analytics initiatives at Swisscom such as Data as a Product (Data Mesh), One Data Platform, Machine Learning (Factory), MetaData management, Self-service data & analytics, BI Tooling Strategy, Cloud Transformation, Big Data platforms,and Data warehousing. Previously, he was a product manager at both Swisscom's B2B and Innovation units both building new products and optimizing mature products (profitability) in the domains of enterprise mobile fleet management, cyber-and mobile device security.Pradeep is also passionate about and experienced in leading the development of data products and transforming IT delivery teams into empowered, agile product teams. And, he is always happy to engage in a conversation about lean product management or "heavier" topics such as humanity's future or our past. Connect with Pradeep on LinkedIn. All views and opinions expressed are those of the individuals and do not necessarily reflect their employers or anyone else.  Join the conversation on LinkedIn.  Apply to be a guest or nominate someone that you know.  Do you love what you're listening to? Please rate and review the podcast, and share it with fellow practitioners you know. Your support helps us reach more listeners and continue providing valuable insights!              

If you are working in or trying to break into data and want to learn how to fast-track your career, this one is for you! In this episode, Jess Ramos (180k+ followers on LinkedIn!) shares her best tips and practical advice to help take your career to the next level.   What You'll Learn: How specializing and building niche skills can lead to big opportunities The importance of a personal brand if you want to accelerate your career Jess' top tips for those looking to break into data and move up quickly   Register for free to be part of the next live session: https://bit.ly/3XB3A8b   About our guest: Jess Ramos is the founder of Big Data Energy, a Senior Data Analyst at Crunchbase, a LinkedIn Learning Instructor, and a content creator in the data space. She loves to empower people to grow their careers in data while breaking industry stereotypes!  Jess' Newsletter Follow Jess on LinkedIn

Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter

Every organization today is exploring generative AI to drive value and push their business forward. But a common pitfall is that AI strategies often don’t align with business objectives, leading companies to chase flashy tools rather than focusing on what truly matters. How can you avoid these traps and ensure your AI efforts are not only innovative but also aligned with real business value?  Leon Gordon, is a leader in data analytics and AI. A current Microsoft Data Platform MVP based in the UK, founder of Onyx Data. During the last decade, he has helped organizations improve their business performance, use data more intelligently, and understand the implications of new technologies such as artificial intelligence and big data. Leon is an Executive Contributor to Brainz Magazine, a Thought Leader in Data Science for the Global AI Hub, chair for the Microsoft Power BI – UK community group and the DataDNA data visualization community as well as an international speaker and advisor. In the episode, Adel and Leon explore aligning AI with business strategy, building AI use-cases, enterprise AI-agents, AI and data governance, data-driven decision making, key skills for cross-functional teams, AI for automation and augmentation, privacy and AI and much more.  Links Mentioned in the Show: Onyx DataConnect with LeonLeon’s Linkedin Course - How to Build and Execute a Successful Data StrategySkill Track: AI Business FundamentalsRelated Episode: Generative AI in the Enterprise with Steve Holden, Senior Vice President and Head of Single-Family Analytics at Fannie MaeRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Our Keynote Panel brings together three Gold Medal Olympians to discuss how they overcame personal challenges and use data to achieve success at the highest levels of sport.

Moderated by Clare Balding, the conversation will delve into how data analytics has transformed their training and competition strategies. They’ll share insights on how data is used across different sports to optimize performance and gain a competitive edge. The discussion will highlight the balance between analytical approaches and the instinctive, experiential aspects of competition.

Attendees will hear inspiring stories of triumph over adversity and gain a deeper understanding of how data is driving success in elite sports today. 

This session offers valuable perspectives on the future of sports analytics and its impact on athletic performance.

Big data means big implications in terms of the cost of infrastructure and the complexity of running in production. LiveEO sells a data product based on Satellite imagery—a dataset that’s notoriously huge. Because this product is used in critical conditions such as disaster response and supply chain compliance, efficiency is key for the product to be successful.

Every data request has serious cost implications, from fetching the raw image from space to actually processing it on thousands of GPUs. To maximize their first-run success rate and therefore efficiency, LiveEO built a composable data platform to minimize failure, expose new data products faster to dependent teams, and scale efficiently.

Big data has moved beyond being just a buzzword; it's now at the heart of modern business strategies. When used effectively and efficiently, data can open up new revenue opportunities, provide deep insights, and even drive social impact. As digital transformation accelerates, data is no longer just a tool—it's woven into the fabric of every part of an organization. Designing and maintaining a tier 1 data platform has become essential to staying ahead of the competition. 

Especially with AI-driven applications on the rise, the convergence of DevSecOps and DataOps is becoming increasingly critical. The recent global disruption caused by a security company's mistake was a wake-up call—highlighting just how high the stakes can be. Building and scaling data platforms isn't enough; security and scalability need to be integral to the entire data lifecycle. 

Bringing more than a decade of SRE experience to maintaining and managing top enterprise software, we will discuss how to tear down silos and encourage collaboration among development, security, operations, and data teams. By doing so, organizations can achieve unprecedented levels of reliability and security. Integrating DevSecOps with DataOps doesn't just automate and protect data operations—it also safeguards data integrity, privacy, and compliance, even as data environments expand in size and complexity. In today's competitive market, this proactive stance is what will set the leaders apart from the rest.

Main Actionable Takeaways:

• Cultivate a Collaborative Culture

• Prioritize Resilience and Recovery

• Integrate Security Seamlessly into Data Pipeline