talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

1217

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

1217 activities · Newest first

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, Data and AI Expert Services and Learning at IBM, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Nancy Hensley, Nancy is currently the Chief Marketing and Product Officer for Stats Perform. Nancy was the Chief Digital Officer at IBM.

Show Notes 1:37 – Nancy’s bio 3:10 - Are we talking Money Ball? 5:52 - On Base percentage 7:08 – Analyse examples  10:02 – Do you control the data? 11:24 – Out there statistics 14:12 - Can analytics go to far? 17:35 – Real time analysis 18:45 – Covid and sports 21:15 – Your role in sports betting 22:50 – What’s the most fascinating thing you’ve learned? 25:23 – What’s the future?

Website - Stats Perform Money Ball Stats Perform - Twitter  Bill James – Baseball Abstract  The Analyst     Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter.    Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Data Analysis with Python and PySpark

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines. In Data Analysis with Python and PySpark you will learn how to: Manage your data as it scales across multiple machines Scale up your data programs with full confidence Read and write data to and from a variety of sources and formats Deal with messy data with PySpark’s data manipulation functionality Discover new data sets and perform exploratory data analysis Build automated data pipelines that transform, summarize, and get insights from data Troubleshoot common PySpark errors Creating reliable long-running jobs Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required. About the Technology The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem. About the Book Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code. What's Inside Organizing your PySpark code Managing your data, no matter the size Scale up your data programs with full confidence Troubleshooting common data pipeline problems Creating reliable long-running jobs About the Reader Written for data scientists and data engineers comfortable with Python. About the Author As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts. Quotes A clear and in-depth introduction for truly tackling big data with Python. - Gustavo Patino, Oakland University William Beaumont School of Medicine The perfect way to learn how to analyze and master huge datasets. - Gary Bake, Brambles Covers both basic and more advanced topics of PySpark, with a good balance between theory and hands-on. - Philippe Van Bergenl, P² Consulting For beginner to pro, a well-written book to help understand PySpark. - Raushan Kumar Jha, Microsoft

Data Mesh

We're at an inflection point in data, where our data management solutions no longer match the complexity of organizations, the proliferation of data sources, and the scope of our aspirations to get value from data with AI and analytics. In this practical book, author Zhamak Dehghani introduces data mesh, a decentralized sociotechnical paradigm drawn from modern distributed architecture that provides a new approach to sourcing, sharing, accessing, and managing analytical data at scale. Dehghani guides practitioners, architects, technical leaders, and decision makers on their journey from traditional big data architecture to a distributed and multidimensional approach to analytical data management. Data mesh treats data as a product, considers domains as a primary concern, applies platform thinking to create self-serve data infrastructure, and introduces a federated computational model of data governance. Get a complete introduction to data mesh principles and its constituents Design a data mesh architecture Guide a data mesh strategy and execution Navigate organizational design to a decentralized data ownership model Move beyond traditional data warehouses and lakes to a distributed data mesh

We talked about:

Rahul’s background What do data engineering managers do and why do we need them? Balancing engineering and management Rahul’s transition into data engineering management The importance of updating your skill set Planning the transition to manager and other challenges Setting expectations for the team and measuring success Data reconciliation GDPR compliance Data modeling for Big Data Advice for people transitioning into data engineering management Staying on top of trends and enabling team members The qualities of a good data engineering team The qualities of a good data engineer candidate (interview advice) The difference between having knowledge and stuffing a CV with buzzwords Advice for students and fresh graduates An overview of an end-to-end data engineering process

Links:

Rahul's LinkedIn: https://www.linkedin.com/in/16rahuljain/

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

The advent of big data, self-service analytics, and cloud applications has created a need for new ways to manage data access. New data access governance tools promise to simplify and standardize data access and authorization across an enterprise. Data management expert, Sanjeev Mohan, provides an industry perspective on this emerging technology and what it means for data analytics teams.

In the physical world, you can see a bridge rusting or a building facade crumbling and know you have to intervene to prevent the infrastructure from collapsing. But when all you have is bits and bytes - digital stuff, like software and data ---how can you tell if your customer-facing digital interactions or data-driven analytics and models are about to go up in smoke?

Observability is a new term that describes what we used to call IT monitoring. The new moniker is fitting given all the technology changes that have happened in the past decade. The cloud, big data, microservices, containers, cloud applications, machine learning, and artificial intelligence have created a dramatically complex IT and data environment that is harder than ever to manage. And the stakes are higher as organizations move their operations online to compete with digital natives. Today, you can't run digital or data operations without observability tools.

Kevin Petrie is one of the industry's foremost experts on observability. He is vice president of research at Eckerson Group where he leads a team of distinguished analysts. He recently wrote an article titled "The Five Shades of Observability" that describes five types of observability tools. In this podcast, we discuss what observability is, why you need it, and the types of available tools. We also speculate on the future of this technology and recommend how to select an appropriate observability product.

Actionable Insights with Amazon QuickSight

Discover the power of Amazon QuickSight with this comprehensive guide. Learn to create stunning data visualizations, integrate machine learning insights, and automate operations to optimize your data analytics workflows. This book offers practical guidance on utilizing QuickSight to develop insightful and interactive business intelligence solutions. What this Book will help me do Understand the role of Amazon QuickSight within the AWS analytics ecosystem. Learn to configure data sources and develop visualizations effectively. Gain skills in adding interactivity to dashboards using custom controls and parameters. Incorporate machine learning capabilities into your dashboards, including forecasting and anomaly detection. Explore advanced features like QuickSight APIs and embedded multi-tenant analytics design. Author(s) None Samatas is an AWS-certified big data solutions architect with years of experience in designing and implementing scalable analytics solutions. With a clear and practical approach, None teaches how to effectively leverage Amazon QuickSight for efficient and insightful business intelligence applications. Their expertise ensures readers will gain actionable skills. Who is it for? This book is ideal for business intelligence (BI) developers and data analysts looking to deepen their expertise in creating interactive dashboards using Amazon QuickSight. It is a perfect guide for professionals aiming to explore machine learning integration in BI solutions. Familiarity with basic data visualization concepts is recommended, but no prior experience with Amazon QuickSight is needed.

Building Big Data Pipelines with Apache Beam

Building Big Data Pipelines with Apache Beam is the essential guide for mastering data processing using Apache Beam. This book covers both the basics and advanced concepts, from implementing pipelines to extending functionalities with custom I/O connectors. By the end, you'll be equipped to build scalable and reusable big data solutions. What this Book will help me do Understand the core principles of Apache Beam and its architecture. Learn how to create efficient data processing pipelines for diverse scenarios. Master the use of stateful processing for real-time data handling. Gain skills in using Beam's portability features for various languages. Explore advanced functionalities like creating custom I/O connectors. Author(s) None Lukavský is a seasoned data engineer with extensive experience in big data technologies and Apache Beam. Having worked on innovative data solutions across industries, None brings hands-on insights and practical expertise to this book. Their approach to teaching ensures readers can directly apply concepts to real-world scenarios. Who is it for? This book is designed for professionals involved in big data, such as data engineers, analysts, and scientists. It is particularly suited for those with an intermediate level of understanding of Java, aiming to expand their skill set to include advanced data pipeline construction. Whether you're stepping into Apache Beam for the first time or looking to deepen your expertise, this book offers valuable, actionable insights.

Data Engineering with AWS

Discover how to effectively build and manage data engineering pipelines using AWS with "Data Engineering with AWS". In this hands-on book, you'll explore the foundational principles of data engineering, learn to architect data pipelines, and work with essential AWS services to process, transform, and analyze data. What this Book will help me do Understand and implement modern data engineering pipelines with AWS services. Gain proficiency in automating data ingestion and transformation using Amazon tools. Perform efficient data queries and analysis leveraging Amazon Athena and Redshift. Create insightful data visualizations using Amazon QuickSight. Apply machine learning techniques to enhance data engineering processes. Author(s) None Eagar, a Senior Data Architect with over twenty-five years of experience, specializes in modern data architectures and cloud solutions. With a rich background in applying data engineering to real-world problems, None Eagar shares expertise in a clear and approachable way for readers. Who is it for? This book is perfect for data engineers and data architects aiming to grow their expertise in AWS-based solutions. It's also geared towards beginners in data engineering wanting to adopt the best practices. Those with a basic understanding of big data and cloud platforms will find it particularly valuable, but prior AWS experience is not required.

Optimizing Databricks Workloads

Unlock the full potential of Apache Spark on the Databricks platform with "Optimizing Databricks Workloads". This book equips you with must-know techniques to effectively configure, manage, and optimize big data processing pipelines. Dive into real-world scenarios and learn practical approaches to reduce costs and improve performance in your data engineering processes. What this Book will help me do Understand and apply optimization techniques for Databricks workloads. Choose the right cluster configurations to maximize efficiency and minimize costs. Leverage Delta Lake for performance-boosted data processing and optimization. Develop skills for managing Spark DataFrames and core functionalities in Databricks. Gain insights into real-world scenarios to effectively improve workload performance. Author(s) Anirudh Kala and the co-authors are experienced practitioners in the fields of data engineering and analytics. With years of professional expertise in leveraging Apache Spark and Databricks, they bring real-world insight into performance optimization. Their approach blends practical instruction with actionable strategies, making this book an essential guide for data engineers aiming to excel in this domain. Who is it for? This book is tailored for data engineers, data scientists, and cloud architects looking to elevate their skills in managing Databricks workloads. Ideal for readers with basic knowledge of Spark and Databricks, it helps them get hands-on with optimization techniques. If you are aiming to enhance your Spark-based data processing systems, this book offers the guidance you need.

Snowflake Essentials: Getting Started with Big Data in the Cloud

Understand the essentials of the Snowflake Database and the overall Snowflake Data Cloud. This book covers how Snowflake’s architecture is different from prior on-premises and cloud databases. The authors also discuss, from an insider perspective, how Snowflake grew so fast to become the largest software IPO of all time. Snowflake was the first database made specifically to be optimized with a cloud architecture. This book helps you get started using Snowflake by first understanding its architecture and what separates it from other database platforms you may have used. You will learn about setting up users and accounts, and then creating database objects. You will know how to load data into Snowflake and query and analyze that data, including unstructured data such as data in XML and JSON formats. You will also learn about Snowflake’s compute platform and the different data sharing options that are available. What YouWill Learn Run analytics in the Snowflake Data Cloud Create users and roles in Snowflake Set up security in Snowflake Set up resource monitors in Snowflake Set up and optimize Snowflake Compute Load, unload, and query structured and unstructured data (JSON, XML) within Snowflake Use Snowflake Data Sharing to share data Set up a Snowflake Data Exchange Use the Snowflake Data Marketplace Who This Book Is For Database professionals or information technology professionals who want to move beyond traditional database technologies by learning Snowflake, a new and massively scalable cloud-based database solution

Mastering Apache Pulsar

Every enterprise application creates data, including log messages, metrics, user activity, and outgoing messages. Learning how to move these items is almost as important as the data itself. If you're an application architect, developer, or production engineer new to Apache Pulsar, this practical guide shows you how to use this open source event streaming platform to handle real-time data feeds. Jowanza Joseph, staff software engineer at Finicity, explains how to deploy production Pulsar clusters, write reliable event streaming applications, and build scalable real-time data pipelines with this platform. Through detailed examples, you'll learn Pulsar's design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the load manager, and the storage layer. This book helps you: Understand how event streaming fits in the big data ecosystem Explore Pulsar producers, consumers, and readers for writing and reading events Build scalable data pipelines by connecting Pulsar with external systems Simplify event-streaming application building with Pulsar Functions Manage Pulsar to perform monitoring, tuning, and maintenance tasks Use Pulsar's operational measurements to secure a production cluster Process event streams using Flink and query event streams using Presto

We talked about:

Eleni’s background Spatial data analytics Responsibilities of a postdoc Publishing papers Best places for data management papers Differences between postdoc and PhD Helping students become successful Research at the DIMA group Identifying important research directions Reviewing papers Underrated topics in data management Research in data cleaning Collaborating with others Choosing the field for Master’s students Choosing the topic for a Master thesis Should I do a PhD? Promoting computer science to female students

Links:

https://www.user.tu-berlin.de/tzirita/

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Tim Freestone. Tim is the founder of Alooba. Alloba is a skills assessment platform for analytics, data science and data engineering. They help businesses identify the best candidates that apply for a role within its company. Show Notes 4:46 – How do you go from economics teacher to head of business intelligence? 7:53 – Do CV’s matter anymore? 13:22 – What business problem is Alooba solving? 16:05 – Do you have any data that supports your theory? 19:01 – Why analytics, data science, data engineering? 20:26 - What do you do that others don’t? 23:50 – How does Alooba define success? 25:42 – Who’s your target client base? 32:40 –Is there a customer you can talk about? 36:24 – What does Alooba mean? Alooba  Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter.  Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Ariel Utnik. Ariel is COO and GM at Verbit, an AI-driven transcription and captioning solution, Ariel has over 20 years of experience in enterprise tech, focusing on driving business efficiencies and enabling scale through AI. He can speak to how AI is transforming industries – from our court systems to entertainment and education – and how it can be used to democratize information across the globe.  Ariel has a strong customer focused background (Like Al), Chief customer officer at Feedvisor, VP customer success at Panaya, and VP Customer success at Clarizen. Show Notes 2:51 – Tell us about customer success 4:35 – How do you make the transition from Chief Customer Officer to COO? 6:47 – What is Verbit’s main mission objective? 14:15 – How many languages do you support? 14:42 – What makes Verbit’s technology better? 19:10 – How do you train your AI? 22:47 – What solutions does Verbit offer? 27:20 – How far are we from a chat translator? 32:30 – What is the size of the market? Verbit Ariel Utnik - LinkedIn Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter.  Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

I'm tired of data elitism. I want to be inclusive. 

If you are in the data space, join me in being inclusive. 

Join me on the journey at DataCareerJumpstart.com

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Essential PySpark for Scalable Data Analytics

Dive into the world of scalable data processing with 'Essential PySpark for Scalable Data Analytics'. This book is a comprehensive guide that helps beginners understand and utilize PySpark to process, analyze, and draw insights from large datasets effectively. With hands-on tutorials and clear explanations, you will gain the confidence to tackle big data analytics challenges. What this Book will help me do Understand and apply the distributed computing paradigm for big data. Learn to perform scalable data ingestion, cleansing, and preparation using PySpark. Create and utilize data lakes and the Lakehouse paradigm for efficient data storage and access. Develop and deploy machine learning models with scalability in mind. Master real-time analytics pipelines and create impactful data visualizations. Author(s) None Nudurupati is an experienced data engineer and educator, specializing in distributed systems and big data technologies. With years of practical experience in the field, None brings a clear and approachable teaching style to technical topics. Passionate about empowering readers, the author has designed this book to be both practical and inspirational for aspiring data practitioners. Who is it for? This book is ideal for data professionals including data scientists, engineers, and analysts looking to scale their data analytics processes. It assumes familiarity with basic data science concepts and Python, as well as some experience with SQL-like data analysis. This is particularly suitable for individuals aiming to expand their knowledge in distributed computing and PySpark to handle big data challenges. Achieving scalable and efficient data solutions is at the core of this guide.

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts.

This week on Making Data Simple, we have Paul Zikopoulos. Paul is the VP of IBM Technology Sales – Skills Vitality & Enablement Global Markets. Paul is an award winning speaker and author and has been at IBM for 28 years.

Show Notes 3:40 – Is Skills Vitality not the prefect job? 5:08 – What’s been your journey at IBM? 8:29 – Hybrid Cloud Operation and Artificial Intelligence is this the right strategy? 21:13 – What is the new maturity curve? 23:53 – Define Data Fabric 26:53 – Why is the Challenger Seller book so important? 29:14 – What makes a great leader? Books Energy Bus Grit Challenger Sale Challenger Customer Effortless Experience  Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter.  Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library

Take a journey toward discovering, learning, and using Apache Spark 3.0. In this book, you will gain expertise on the powerful and efficient distributed data processing engine inside of Apache Spark; its user-friendly, comprehensive, and flexible programming model for processing data in batch and streaming; and the scalable machine learning algorithms and practical utilities to build machine learning applications. Beginning Apache Spark 3 begins by explaining different ways of interacting with Apache Spark, such as Spark Concepts and Architecture, and Spark Unified Stack. Next, it offers an overview of Spark SQL before moving on to its advanced features. It covers tips and techniques for dealing with performance issues, followed by an overview of the structured streaming processing engine. It concludes with a demonstration of how to develop machine learning applications using Spark MLlib and how to manage the machine learning development lifecycle. This book is packed with practical examples and code snippets to help you master concepts and features immediately after they are covered in each section. After reading this book, you will have the knowledge required to build your own big data pipelines, applications, and machine learning applications. What You Will Learn Master the Spark unified data analytics engine and its various components Work in tandem to provide a scalable, fault tolerant and performant data processing engine Leverage the user-friendly and flexible programming model to perform simple to complex data analytics using dataframe and Spark SQL Develop machine learning applications using Spark MLlib Manage the machine learning development lifecycle using MLflow Who This Book Is For Data scientists, data engineers and software developers.

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Data Engineering with Apache Spark, Delta Lake, and Lakehouse is a comprehensive guide packed with practical knowledge for building robust and scalable data pipelines. Throughout this book, you will explore the core concepts and applications of Apache Spark and Delta Lake, and learn how to design and implement efficient data engineering workflows using real-world examples. What this Book will help me do Master the core concepts and components of Apache Spark and Delta Lake. Create scalable and secure data pipelines for efficient data processing. Learn best practices and patterns for building enterprise-grade data lakes. Discover how to operationalize data models into production-ready pipelines. Gain insights into deploying and monitoring data pipelines effectively. Author(s) None Kukreja is a seasoned data engineer with over a decade of experience working with big data platforms. He specializes in implementing efficient and scalable data solutions to meet the demands of modern analytics and data science. Writing with clarity and a practical approach, he aims to provide actionable insights that professionals can apply to their projects. Who is it for? This book is tailored for aspiring data engineers and data analysts who wish to delve deeper into building scalable data platforms. It is suitable for those with basic knowledge of Python, Spark, and SQL, and seeking to learn Delta Lake and advanced data engineering concepts. Readers should be eager to develop practical skills for tackling real-world data engineering challenges.