talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

299

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: O'Reilly Data Engineering Books ×
Engineering Lakehouses with Open Table Formats

Engineering Lakehouses with Open Table Formats introduces the architecture and capabilities of open table formats like Apache Iceberg, Apache Hudi, and Delta Lake. The book guides you through the design, implementation, and optimization of lakehouses that can handle modern data processing requirements effectively with real-world practical insights. What this Book will help me do Understand the fundamentals of open table formats and their benefits in lakehouse architecture. Learn how to implement performant data processing using tools like Apache Spark and Flink. Master advanced topics like indexing, partitioning, and interoperability between data formats. Explore data lifecycle management and integration with frameworks like Apache Airflow and dbt. Build secure lakehouses with regulatory compliance using best practices detailed in the book. Author(s) Dipankar Mazumdar and Vinoth Govindarajan are seasoned professionals with extensive experience in big data processing and software architecture. They bring their expertise from working with data lakehouses and are known for their ability to explain complex technical concepts clearly. Their collaborative approach brings valuable insights into the latest trends in data management. Who is it for? This book is ideal for data engineers, architects, and software professionals aiming to master modern lakehouse architectures. If you are familiar with data lakes or warehouses and wish to transition to an open data architectural design, this book is suited for you. Readers should have basic knowledge of databases, Python, and Apache Spark for the best experience.

Data Engineering for Beginners

A hands-on technical and industry roadmap for aspiring data engineers In Data Engineering for Beginners, big data expert Chisom Nwokwu delivers a beginner-friendly handbook for everyone interested in the fundamentals of data engineering. Whether you're interested in starting a rewarding, new career as a data analyst, data engineer, or data scientist, or seeking to expand your skillset in an existing engineering role, Nwokwu offers the technical and industry knowledge you need to succeed. The book explains: Database fundamentals, including relational and noSQL databases Data warehouses and data lakes Data pipelines, including info about batch and stream processing Data quality dimensions Data security principles, including data encryption Data governance principles and data framework Big data and distributed systems concepts Data engineering on the cloud Essential skills and tools for data engineering interviews and jobs Data Engineering for Beginners offers an easy-to-read roadmap on a seemingly complicated and intimidating subject. It addresses the topics most likely to cause a beginning data engineer to stumble, clearly explaining key concepts in an accessible way. You'll also find: A comprehensive glossary of data engineering terms Common and practical career paths in the data engineering industry An introduction to key cloud technologies and services you may encounter early in your data engineering career Perfect for practicing and aspiring data analysts, data scientists, and data engineers, Data Engineering for Beginners is an effective and reliable starting point for learning an in-demand skill. It's a powerful resource for everyone hoping to expand their data engineering Skillset and upskill in the big data era.

The Definitive Guide to OpenSearch

Learn how to harness the power of OpenSearch effectively with 'The Definitive Guide to OpenSearch'. This book explores installation, configuration, query building, and visualization, guiding readers through practical use cases and real-world implementations. Whether you're building search experiences or analyzing data patterns, this guide equips you thoroughly. What this Book will help me do Understand core OpenSearch principles, architecture, and the mechanics of its search and analytics capabilities. Learn how to perform data ingestion, execute advanced queries, and produce insightful visualizations on OpenSearch Dashboards. Implement scaling strategies and optimum configurations for high-performance OpenSearch clusters. Explore real-world case studies that demonstrate OpenSearch applications in diverse industries. Gain hands-on experience through practical exercises and tutorials for mastering OpenSearch functionality. Author(s) Jon Handler, Soujanya Konka, and Prashant Agrawal, celebrated experts in search technologies and big data analysis, bring their years of experience at AWS and other domains to this book. Their collective expertise ensures that readers receive both core theoretical knowledge and practical applications to implement directly. Who is it for? This book is aimed at developers, data professionals, engineers, and systems operators who work with search systems or analytics platforms. It is especially suitable for individuals in roles handling large-scale data, who want to improve their skills or deploy OpenSearch in production environments. Early learners and seasoned experts alike will find valuable insights.

Data Engineering with AWS Cookbook

Data Engineering with AWS Cookbook serves as a comprehensive practical guide for building scalable and efficient data engineering solutions using AWS. With this book, you will master implementing data lakes, orchestrating data pipelines, and creating serving layers using AWS's robust services, such as Glue, EMR, Redshift, and Athena. With hands-on exercises and practical recipes, you will enhance your AWS-based data engineering projects. What this Book will help me do Gain the skills to design centralized data lake solutions and manage them securely at scale. Develop expertise in crafting data pipelines with AWS's ETL technologies like Glue and EMR. Learn to implement and automate governance, orchestration, and monitoring for data platforms. Build high-performance data serving layers using AWS analytics tools like Redshift and QuickSight. Effectively plan and execute data migrations to AWS from on-premises infrastructure. Author(s) Trâm Ngọc Phạm, Gonzalo Herreros González, Viquar Khan, and Huda Nofal bring together years of collective experience in data engineering and AWS cloud solutions. Each author's deep knowledge and passion for cloud technology have shaped this book into a valuable resource, geared towards practical learning and real-world application. Their approach ensures readers are not just learning but building tangible, impactful solutions. Who is it for? This book is geared towards data engineers and big data professionals engaged in or transitioning to cloud-based environments, specifically on AWS. Ideal readers are those looking to optimize workflows and master AWS tools to create scalable, efficient solutions. The content assumes a basic familiarity with AWS concepts like IAM roles and a command-line interface, ensuring all examples are accessible yet meaningful for those seeking advancement in AWS data engineering.

Apache Spark for Machine Learning

Dive into the power of Apache Spark as a tool for handling and processing big data required for machine learning. With this book, you will explore how to configure, execute, and deploy machine learning algorithms using Spark's scalable architecture and learn best practices for implementing real-world big data solutions. What this Book will help me do Understand the integration of Apache Spark with large-scale infrastructures for machine learning applications. Employ data processing techniques for preprocessing and feature engineering efficiently with Spark. Master the implementation of advanced supervised and unsupervised learning algorithms using Spark. Learn to deploy machine learning models within Spark ecosystems for optimized performance. Discover methods for analyzing big data trends and machine learning model tuning for improved accuracy. Author(s) The author, Deepak Gowda, is an experienced data scientist with over ten years of expertise in machine learning and big data. His career spans industries such as supply chain, cybersecurity, and more where he has utilized Apache Spark extensively. Deepak's teaching style is marked by clarity and practicality, making complex concepts approachable. Who is it for? Apache Spark for Machine Learning is tailored for data engineers, machine learning practitioners, and computer science students looking to advance their ability to process, analyze, and model using large datasets. If you're already familiar with basic machine learning and want to scale your solutions using Spark, this book is ideal for your studies and professional growth.

Data Engineering Best Practices

Unlock the secrets to building scalable and efficient data architectures with 'Data Engineering Best Practices.' This book provides in-depth guidance on designing, implementing, and optimizing cloud-based data pipelines. You will gain valuable insights into best practices, agile workflows, and future-proof designs. What this Book will help me do Effectively plan and architect scalable data solutions leveraging cloud-first strategies. Master agile processes tailored to data engineering for improved project outcomes. Implement secure, efficient, and reliable data pipelines optimized for analytics and AI. Apply real-world design patterns and avoid common pitfalls in data flow and processing. Create future-ready data engineering solutions following industry-proven frameworks. Author(s) Richard J. Schiller and David Larochelle are seasoned data engineering experts with decades of experience crafting efficient and secure cloud-based infrastructures. Their collaborative writing distills years of real-world expertise into practical advice aimed at helping engineers succeed in a rapidly evolving field. Who is it for? This book is ideal for data engineers, ETL specialists, and big data professionals seeking to enhance their knowledge in cloud-based solutions. Some familiarity with data engineering, ETL pipelines, and big data technologies is helpful. It suits those keen on mastering advanced practices, improving agility, and developing efficient data pipelines. Perfect for anyone looking to future-proof their skills in data engineering.

Big Data on Kubernetes

Big Data on Kubernetes is your comprehensive guide to leveraging Kubernetes for scalable and efficient big data solutions. You will learn key concepts of Kubernetes architecture and explore tools like Apache Spark, Airflow, and Kafka. Gain hands-on experience building complete data pipelines to tackle real-world data challenges. What this Book will help me do Understand Kubernetes architecture and learn to deploy and manage clusters. Build and orchestrate big data pipelines using Spark, Airflow, and Kafka. Develop scalable and resilient data solutions with Docker and Kubernetes. Integrate and optimize data tools for real-time ingestion and processing. Apply concepts to hands-on projects addressing actual big data scenarios. Author(s) Neylson Crepalde is an experienced data specialist with extensive knowledge of Kubernetes and big data solutions. With deep practical experience, Neylson brings real-world insights to his writing. His approach emphasizes actionable guidance and relatable problem-solving with a strong foundation in scalable architecture. Who is it for? This book is ideal for data engineers, BI analysts, data team leaders, and tech managers familiar with Python, SQL, and YAML. Targeted at professionals seeking to develop or expand their expertise in scalable big data solutions, it provides practical insights into Docker, Kubernetes, and prominent big data tools.

Databricks Certified Associate Developer for Apache Spark Using Python

This book serves as the ultimate preparation for aspiring Databricks Certified Associate Developers specializing in Apache Spark. Deep dive into Spark's components, its applications, and exam techniques to achieve certification and expand your practical skills in big data processing and real-time analytics using Python. What this Book will help me do Deeply understand Apache Spark's core architecture for building big data applications. Write optimized SQL queries and leverage Spark DataFrame API for efficient data manipulation. Apply advanced Spark functions, including UDFs, to solve complex data engineering tasks. Use Spark Streaming capabilities to implement real-time and near-real-time processing solutions. Get hands-on preparation for the certification exam with mock tests and practice questions. Author(s) Saba Shah is a seasoned data engineer with extensive experience working at Databricks and leading data science teams. With her in-depth knowledge of big data applications and Spark, she delivers clear, actionable insights in this book. Her approach emphasizes practical learning and real-world applications. Who is it for? This book is ideal for data professionals such as engineers and analysts aiming to achieve Databricks certification. It is particularly helpful for individuals with moderate Python proficiency who are keen to understand Spark from scratch. If you're transitioning into big data roles, this guide prepares you comprehensively.

Data Engineering with Databricks Cookbook

In "Data Engineering with Databricks Cookbook," you'll learn how to efficiently build and manage data pipelines using Apache Spark, Delta Lake, and Databricks. This recipe-based guide offers techniques to transform, optimize, and orchestrate your data workflows. What this Book will help me do Master Apache Spark for data ingestion, transformation, and analysis. Learn to optimize data processing and improve query performance with Delta Lake. Manage streaming data processing with Spark Structured Streaming capabilities. Implement DataOps and DevOps workflows tailored for Databricks. Enforce data governance policies using Unity Catalog for scalable solutions. Author(s) Pulkit Chadha, the author of this book, is a Senior Solutions Architect at Databricks. With extensive experience in data engineering and big data applications, he brings practical insights into implementing modern data solutions. His educational writings focus on empowering data professionals with actionable knowledge. Who is it for? This book is ideal for data engineers, data scientists, and analysts who want to deepen their knowledge in managing and transforming large datasets. Readers should have an intermediate understanding of SQL, Python programming, and basic data architecture concepts. It is especially well-suited for professionals working with Databricks or similar cloud-based data platforms.

Databricks ML in Action

Dive into the Databricks Data Intelligence Platform and learn how to harness its full potential for creating, deploying, and maintaining machine learning solutions. This book covers everything from setting up your workspace to integrating state-of-the-art tools such as AutoML and VectorSearch, imparting practical skills through detailed examples and code. What this Book will help me do Set up and manage a Databricks workspace tailored for effective data science workflows. Implement monitoring to ensure data quality and detect drift efficiently. Build, fine-tune, and deploy machine learning models seamlessly using Databricks tools. Operationalize AI projects including feature engineering, data pipelines, and workflows on the Databricks Lakehouse architecture. Leverage integrations with popular tools like OpenAI's ChatGPT to expand your AI project capabilities. Author(s) This book is authored by Stephanie Rivera, Anastasia Prokaieva, Amanda Baker, and Hayley Horn, seasoned experts in data science and machine learning from Databricks. Their collective years of expertise in big data and AI technologies ensure a rich and insightful perspective. Through their work, they strive to make complex concepts accessible and actionable. Who is it for? This book serves as an ideal guide for machine learning engineers, data scientists, and technically inclined managers. It's well-suited for those transitioning to the Databricks environment or seeking to deepen their Databricks-based machine learning implementation skills. Whether you're an ambitious beginner or an experienced professional, this book provides clear pathways to success.

Practical MongoDB Aggregations

Dive into the capabilities of the MongoDB aggregation framework with this official guide, "Practical MongoDB Aggregations". You'll learn how to design and optimize efficient aggregation pipelines for MongoDB 7.0, empowering you to handle complex data analysis and processing tasks directly within the database. What this Book will help me do Gain expertise in crafting advanced MongoDB aggregation pipelines for custom data workflows. Learn to perform time series analysis for financial datasets and IoT applications. Discover optimization techniques for working with sharded clusters and large datasets. Master array manipulation and other specific operations essential for MongoDB data models. Build pipelines that ensure data security and distribution while maintaining performance. Author(s) Paul Done, a recognized expert in MongoDB, brings his extensive experience in database technologies to this book. With years of practice in helping companies leverage MongoDB for big data solutions, Paul shares his deep knowledge in an accessible and logical manner. His approach to writing is hands-on, focusing on practical insights and clear explanations. Who is it for? This book is tailored for intermediate-level developers, database architects, data analysts, engineers, and scientists who use MongoDB. If you are familiar with MongoDB and looking to expand your understanding specifically around its aggregation capabilities, this guide is for you. Whether you're analyzing time series data or need to optimize pipelines for performance, you'll find actionable tips and examples here to suit your needs.

Big Data Computing

This book primarily aims to provide an in-depth understanding of recent advances in big data computing technologies, methodologies, and applications along with introductory details of big data computing models such as Apache Hadoop, MapReduce, Hive, Pig, Mahout in-memory storage systems, NoSQL databases, and big data streaming services.

Deciphering Data Architectures

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they're also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of these architectures to help data professionals understand the pros and cons of each. James Serra, big data and data warehousing solution architect at Microsoft, examines common data architecture concepts, including how data warehouses have had to evolve to work with data lake features. You'll learn what data lakehouses can help you achieve, as well as how to distinguish data mesh hype from reality. Best of all, you'll be able to determine the most appropriate data architecture for your needs. With this book, you'll: Gain a working understanding of several data architectures Learn the strengths and weaknesses of each approach Distinguish data architecture theory from reality Pick the best architecture for your use case Understand the differences between data warehouses and data lakes Learn common data architecture concepts to help you build better solutions Explore the historical evolution and characteristics of data architectures Learn essentials of running an architecture design session, team organization, and project success factors Free from product discussions, this book will serve as a timeless resource for years to come.

Architecting a Modern Data Warehouse for Large Enterprises: Build Multi-cloud Modern Distributed Data Warehouses with Azure and AWS

Design and architect new generation cloud-based data warehouses using Azure and AWS. This book provides an in-depth understanding of how to build modern cloud-native data warehouses, as well as their history and evolution. The book starts by covering foundational data warehouse concepts, and introduces modern features such as distributed processing, big data storage, data streaming, and processing data on the cloud. You will gain an understanding of the synergy, relevance, and usage data warehousing standard practices in the modern world of distributed data processing. The authors walk you through the essential concepts of Data Mesh, Data Lake, Lakehouse, and Delta Lake. And they demonstrate the services and offerings available on Azure and AWS that deal with data orchestration, data democratization, data governance, data security, and business intelligence. After completing this book, you will be ready to design and architect enterprise-grade, cloud-based modern data warehouses using industry best practices and guidelines. What You Will Learn Understand the core concepts underlying modern data warehouses Design and build cloud-native data warehousesGain a practical approach to architecting and building data warehouses on Azure and AWS Implement modern data warehousing components such as Data Mesh, Data Lake, Delta Lake, and Lakehouse Process data through pandas and evaluate your model’s performance using metrics such as F1-score, precision, and recall Apply deep learning to supervised, semi-supervised, and unsupervised anomaly detection tasks for tabular datasets and time series applications Who This Book Is For Experienced developers, cloud architects, and technology enthusiasts looking to build cloud-based modern data warehouses using Azure and AWS

Data Exploration and Preparation with BigQuery

In "Data Exploration and Preparation with BigQuery," Michael Kahn provides a hands-on guide to understanding and utilizing Google's powerful data warehouse solution, BigQuery. This comprehensive book equips you with the skills needed to clean, transform, and analyze large datasets for actionable business insights. What this Book will help me do Master the process of exploring and assessing the quality of datasets. Learn SQL for performing efficient and advanced data transformations in BigQuery. Optimize the performance of BigQuery queries for speed and cost-effectiveness. Discover best practices for setting up and managing BigQuery resources. Apply real-world case studies to analyze data and derive meaningful insights. Author(s) Michael Kahn is an experienced data engineer and author specializing in big data solutions and technologies. With years of hands-on experience working with Google Cloud Platform and BigQuery, he has assisted organizations in optimizing their data pipelines for effective decision-making. His accessible writing style ensures complex topics become approachable, enabling readers of various skill levels to succeed. Who is it for? This book is tailored for data analysts, data engineers, and data scientists who want to learn how to effectively use BigQuery for data exploration and preparation. Whether you're new to BigQuery or looking to deepen your expertise in working with large datasets, this book provides clear guidance and practical examples to achieve your goals.

Delta Lake: Up and Running

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS. This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights. You'll learn how to: Use modern data management and data engineering techniques Understand how ACID transactions bring reliability to data lakes at scale Run streaming and batch jobs against your data lake concurrently Execute update, delete, and merge commands against your data lake Use time travel to roll back and examine previous data versions Build a streaming data quality pipeline following the medallion architecture

AI for Big Data-Based Engineering Applications from Security Perspectives

This book emphasizes the idea of understanding the motivation of the advanced circuits’ design to establish the AI interface and to mitigate the security attacks in a better way for big data. It is for students, researchers, and professionals, faculty members and software developers who wish to carry out further research.

IBM FlashSystem 7300 Product Guide

This IBM® Redpaper Product Guide describes the IBM FlashSystem® 7300 solution, which is a next-generation IBM FlashSystem control enclosure. It combines the performance of flash and a Non-Volatile Memory Express (NVMe)-optimized architecture with the reliability and innovation of IBM FlashCore® technology and the rich feature set and high availability (HA) of IBM Spectrum® Virtualize. To take advantage of artificial intelligence (AI)-enhanced applications, real-time big data analytics, and cloud architectures that require higher levels of system performance and storage capacity, enterprises around the globe are rapidly moving to modernize established IT infrastructures. However, for many organizations, staff resources, and expertise are limited, and cost-efficiency is a top priority. These organizations have important investments in existing infrastructure that they want to maximize. They need enterprise-grade solutions that optimize cost-efficiency while simplifying the pathway to modernization. IBM FlashSystem 7300 is designed specifically for these requirements and use cases. It also delivers a cyber resilience without compromising application performance. IBM FlashSystem 7300 provides a rich set of software-defined storage (SDS) features that are delivered by IBM Spectrum Virtualize, including the following examples: Data reduction and deduplication Dynamic tiering Thin-provisioning Snapshots Cloning Replication and data copy services Cyber resilience Transparent Cloud Tiering (TCT) IBM HyperSwap® including 3-site replication for high availability Scale-out and scale-up configurations further enhance capacity and throughput for better availability With the release of IBM Spectrum Virtualize V8.5, extra functions and features are available, including support for new third-generation IBM FlashCore Modules Non-Volatile Memory Express (NVMe) type drives within the control enclosure, and 100 Gbps Ethernet adapters that provide NVMe Remote Direct Memory Access (RDMA) options. New software features include GUI enhancements, security enhancements including multifactor authentication and single sign-on, and Fibre Channel (FC) portsets.