O'Reilly Data Engineering Books

Fifty Years of Data Management and Beyond

2019-04-26 O'Reilly Amazon

book

Paco Nathan

data data-engineering Big Data Cloud Computing Data Management Data Science

Every decade since the 1960s, researchers at companies like IBM, Amazon, and many others have introduced major new frameworks and techniques to handle rising data management problems. This concise ebook explains how these new systems helped data science evolve quickly—from hierarchical and relational databases to big data and cloud computing to streaming and graph data. Computer scientist Paco Nathan shows members of your data science team how major companies created each of these data management systems not just to deal with new data types but also to take full advantage of the opportunities the data presented. Their efforts over the years have propelled an entire industry. This report covers the historical progression of data management topics including: Hierarchical databases—1960s mainframe batch systems are still used in finance, healthcare, manufacturing, energy, and other industries. Relational databases—these enabled faster transactions, mathematical optimization, and budgeting guarantees for many businesses. Big data—this includes relatively cheap horizontal scale-out systems for collecting huge amounts of customer data. Cloud computing—large companies began managing reliable, scalable, cost-effective data centers; Amazon turned the concept into a business. Cluster schedulers—managing horizontal clusters was difficult before schedulers such as Apache Mesos appeared. Streaming data—data continuously generated by different sources requires responses in "real time"—generally milliseconds.

Mastering MongoDB 4.x - Second Edition

2019-03-30 O'Reilly Amazon

book

Alex Giamas

data data-engineering nosql-databases MongoDB Big Data Cloud Computing

This book, Mastering MongoDB 4.x, provides an in-depth exploration of MongoDB's features and capabilities, empowering readers to create high-performance and fault-tolerant database solutions. Through practical examples and clear explanations, you will learn how to implement complex queries, optimize database performance, manage large-scale clusters, and ensure robust failover and backup strategies. What this Book will help me do Understand advanced querying techniques and best practices in data indexing and management. Effectively configure and monitor MongoDB instances for scalability and optimized performance. Master techniques for replication and sharding to support high-availability systems. Deploy MongoDB-based applications seamlessly across on-premise and cloud environments. Learn to integrate MongoDB with modern technologies like big data platforms, containers, and IoT applications. Author(s) Alex Giamas is a seasoned database administrator and developer with significant experience in working with both relational and non-relational databases. Having authored numerous articles and given lectures on MongoDB and other data management technologies, Alex brings practical insights to his writing. He emphasizes real-world applications with examples drawn from his extensive career. Who is it for? This book is designed for developers and database administrators already familiar with MongoDB and basic database concepts, who are looking to enhance their expertise for implementing advanced MongoDB solutions. It is also suitable for professionals aspiring to earn MongoDB certifications and expand their skills to manage large, high-performance database systems efficiently.

Hands-On Big Data Analytics with PySpark

2019-03-29 O'Reilly Amazon

book

Bartłomiej Potaczek , Rudy Lai

data data-engineering apache-spark PySpark Analytics Big Data

Dive into the exciting world of big data analytics with 'Hands-On Big Data Analytics with PySpark'. This practical guide offers you the tools and knowledge to tackle massive datasets using PySpark. By exploring real-world examples, you'll learn to unleash the power of distributed systems to analyze and manipulate data at scale. What this Book will help me do Master using PySpark to handle large and complex datasets efficiently and effectively. Develop skills to optimize Spark programs using best practices like reducing shuffle operations. Learn to set up a PySpark environment, process data from platforms like HDFS, Hive, and S3. Enhance your data analytics capabilities by implementing powerful SQL queries and data visualizations. Understand testing and debugging techniques to build reliable, production-quality data pipelines. Author(s) Authored by Rudy Lai and Bartłomiej Potaczek, both seasoned data engineers and authors in the big data field. Rudy and Bartłomiej bring their extensive experience working with distributed systems and scalable data architectures into this book. Their approach is hands-on, focusing on real-world applications and best practices. Who is it for? This book is tailored for data scientists, engineers, and developers eager to advance their big data analytics capabilities. Whether you're new to big data or experienced with other analytics frameworks, this book will equip you with practical knowledge to utilize PySpark for scalable data solutions.

Data Lake Maturity Model

2019-03-25 O'Reilly Amazon

book

Scott Gidley , Andy Oram

data data-engineering storage-repositories data-lake Analytics Big Data

Data is changing everything. Many industries today are being fundamentally transformed through the accumulation and analysis of large quantities of data, stored in diversified but flexible repositories known as data lakes. Whether your company has just begun to think about big data or has already initiated a strategy for handling it, this practical ebook shows you how to plan a successful data lake migration. You’ll learn the value of data lakes, their structure, and the problems they attempt to solve. Using Zaloni’s data lake maturity model, you’ll then explore your organization’s readiness for putting a data lake into action. Do you have the tools and data architectures to support big data analysis? Are your people and processes prepared? The data lake maturity model will help you rate your organization’s readiness. This report includes: The structure and purpose of a data lake Descriptive, predictive, and prescriptive analytics Data lake curation, self-service, and the use of data lake zones How to rate your organization using the data lake maturity model A complete checklist to help you determine your strategic path forward

AI and Big Data on IBM Power Systems Servers

2019-03-22 O'Reilly Amazon

book

Rafael Freitas de Lima Ivaylo B. Bozhinov Scott Vetter Anto A John Ahmed. Mashhour, James Van Oosten, Fernando Vermelho, Allison White

data data-engineering IBM ibm-power-systems AI/ML Analytics

Abstract As big data becomes more ubiquitous, businesses are wondering how they can best leverage it to gain insight into their most important business questions. Using machine learning (ML) and deep learning (DL) in big data environments can identify historical patterns and build artificial intelligence (AI) models that can help businesses to improve customer experience, add services and offerings, identify new revenue streams or lines of business (LOBs), and optimize business or manufacturing operations. The power of AI for predictive analytics is being harnessed across all industries, so it is important that businesses familiarize themselves with all of the tools and techniques that are available for integration with their data lake environments. In this IBM® Redbooks® publication, we cover the best practices for deploying and integrating some of the best AI solutions on the market, including: IBM Watson Machine Learning Accelerator (see note for product naming) IBM Watson Studio Local IBM Power Systems™ IBM Spectrum™ Scale IBM Data Science Experience (IBM DSX) IBM Elastic Storage™ Server Hortonworks Data Platform (HDP) Hortonworks DataFlow (HDF) H2O Driverless AI We map out all the integrations that are possible with our different AI solutions and how they can integrate with your existing or new data lake. We also walk you through some of our client use cases and show you how some of the industry leaders are using Hortonworks, IBM PowerAI, and IBM Watson Studio Local to drive decision making. We also advise you on your deployment options, when to use a GPU, and why you should use the IBM Elastic Storage Server (IBM ESS) to improve storage management. Lastly, we describe how to integrate IBM Watson Machine Learning Accelerator and Hortonworks with or without IBM Watson Studio Local, how to access real-time data, and security. Note: IBM Watson Machine Learning Accelerator is the new product name for IBM PowerAI Enterprise. Note: Hortonworks merged with Cloudera in January 2019. The new company is called Cloudera. References to Hortonworks as a business entity in this publication are now referring to the merged company. Product names beginning with Hortonworks continue to be marketed and sold under their original names.

The Enterprise Big Data Lake

2019-03-11 O'Reilly Amazon

book

Alex Gorelik

data data-engineering storage-repositories data-lake Big Data Data Lake

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities. But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises. You’ll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book. Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise. Then, in a collection of essays about data lake implementation, you’ll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries. Get a succinct introduction to data warehousing, big data, and data science Learn various paths enterprises take to build a data lake Explore how to build a self-service model and best practices for providing analysts access to the data Use different methods for architecting your data lake Discover ways to implement a data lake from experts in different industries

Mastering Hadoop 3

2019-02-28 O'Reilly Amazon

book

Timothy Wong , Chanchal Singh , Manish Kumar

data data-engineering Hadoop Flink Big Data Data Engineering

"Mastering Hadoop 3" is your in-depth guide to understanding and mastering the advanced features of the Hadoop ecosystem. With a focus on distributed computing and data processing, this book covers essential tools such as YARN, MapReduce, and Apache Spark to help you build scalable, efficient data pipelines. What this Book will help me do Gain a comprehensive understanding of Hadoop Distributed File System (HDFS) and YARN for effective resource management. Master data processing with MapReduce and learn to integrate with real-time processing engines like Spark and Flink. Develop and secure enterprise-grade Hadoop-based data pipelines by implementing robust security and governance measures. Explore techniques for batch data processing, data modeling, and designing applications tailored for Hadoop environments. Understand best practices for optimizing and troubleshooting Hadoop clusters for enhanced performance and reliability. Author(s) The authors, including None Wong, None Singh, and None Kumar, bring together years of experience in big data engineering, distributed systems, and enterprise application development. They aim to provide a clear pathway to mastering Hadoop ecosystem tools. Who is it for? This book is ideal for budding big data professionals who have some familiarity with Java and basic Hadoop concepts and wish to elevate their expertise. If you're a Hadoop career practitioner keen to expand your understanding of the ecosystem's advanced capabilities or a professional looking to implement Hadoop in organizational workflows, this book is well-suited for you.

IBM Elastic Storage Server Implementation Guide for Version 5.3

2019-02-05 O'Reilly Amazon

book

Kiran Ghag , Ravindra Sure , Vasfi Gucer , Nikhil Khandelwal , Poornima Gupte , Puneet Chaudhary , Luis Bolinches

data data-engineering IBM Big Data Cloud Computing ELK

This IBM® Redpaper™ publication introduces and describes the IBM Elastic Storage™ Server as a scalable, high-performance data and file management solution. The solution is built on proven IBM Spectrum™ Scale technology, formerly IBM General Parallel File System (GPFS™). IBM Elastic Storage Servers can be implemented for a range of diverse requirements, providing reliability, performance, and scalability. This publication helps you to understand the solution and its architecture and helps you to plan the installation and integration of the environment. The following combination of physical and logical components are required: Hardware Operating system Storage Network Applications This paper provides guidelines for several usage and integration scenarios. Typical scenarios include Cluster Export Services (CES) integration, disaster recovery, and multicluster integration. This paper addresses the needs of technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who must deliver cost-effective cloud services and big data solutions.

Apache Spark Quick Start Guide

2019-01-31 O'Reilly Amazon

book

Akash Grade , Shrey Mehrotra

data data-engineering apache-spark AI/ML API Big Data

Dive into the world of scalable data processing with the "Apache Spark Quick Start Guide." This book offers a foundational introduction to Spark, empowering readers to harness its capabilities for big data processing. With clear explanations and hands-on examples, you'll learn to implement Spark applications that handle complex data tasks efficiently. What this Book will help me do Understand and implement Spark's RDDs and DataFrame APIs to process large datasets effectively. Set up a local development environment for Spark-based projects. Develop skills to debug and optimize slow-performing Spark applications. Harness built-in modules of Spark for SQL, streaming, and machine learning applications. Adopt best practices and optimization techniques for high-performance Spark applications. Author(s) Shrey Mehrotra is a seasoned software developer with expertise in big data technologies, particularly Apache Spark. With years of hands-on industry experience, Shrey focuses on making complex technical concepts accessible to all. Through his writing, he aims to share clear, practical guidance for developers of all levels. Who is it for? This guide is perfect for big data enthusiasts and professionals looking to learn Apache Spark's capabilities from scratch. It's aimed at data engineers interested in optimizing application performance and data scientists wanting to integrate machine learning with Spark. A basic familiarity with either Scala, Python, or Java is recommended.

Machine Learning with Apache Spark Quick Start Guide

2018-12-26 O'Reilly Amazon

book

Jillur Quddus

data data-engineering apache-spark AI/ML Analytics Big Data

"Machine Learning with Apache Spark Quick Start Guide" introduces you to the fundamental concepts and tools needed to harness the power of Apache Spark for data processing and machine learning. This book combines practical examples and real-world scenarios to show you how to manage big data efficiently while uncovering actionable insights through advanced analytics. What this Book will help me do Understand the role of Apache Spark in the big data ecosystem. Set up and configure an Apache Spark development environment. Learn and implement supervised and unsupervised learning models using Spark MLlib. Apply advanced analytical algorithms to real-world big data problems. Develop and deploy real-time machine learning pipelines with Apache Spark. Author(s) None Quddus is an experienced practitioner in the fields of big data, distributed technologies, and machine learning. With a career dedicated to using advanced analytics to solve real-world problems, Quddus brings practical expertise to each topic addressed. Their approachable writing style ensures readers can apply concepts effectively, even in complex scenarios. Who is it for? This book is ideal for business analysts, data analysts, and data scientists who are eager to gain hands-on experience with big data technologies. Whether you are new to Apache Spark or looking to expand your knowledge of its machine learning capabilities, this guide provides the tools and insights necessary to achieve those goals. Technical professionals wanting to develop their skills in processing and analyzing big data will find this resource invaluable.

Fast Data Architectures for Streaming Applications, 2nd Edition

2018-12-25 O'Reilly Amazon

book

Dean Wampler

data data-engineering streaming-messaging Kafka Flink Big Data

Why have stream-oriented data systems become so popular, when batch-oriented systems have served big data needs for many years? In the updated edition of this report, Dean Wampler examines the rise of streaming systems for handling time-sensitive problems—such as detecting fraudulent financial activity as it happens. You’ll explore the characteristics of fast data architectures, along with several open source tools for implementing them. Batch processing isn’t going away, but exclusive use of these systems is now a competitive disadvantage. You’ll learn that, while fast data architectures using tools such as Kafka, Akka, Spark, and Flink are much harder to build, they represent the state of the art for dealing with mountains of data that require immediate attention. Learn how a basic fast data architecture works, step-by-step Examine how Kafka’s data backplane combines the best abstractions of log-oriented and message queue systems for integrating components Evaluate four streaming engines, including Kafka Streams, Akka Streams, Spark, and Flink Learn which streaming engines work best for different use cases Get recommendations for making real-world streaming systems responsive, resilient, elastic, and message driven Explore an example IoT streaming application that includes telemetry ingestion and anomaly detection

Apache Spark 2: Data Processing and Real-Time Analytics

2018-12-21 O'Reilly Amazon

book

Romeo Kienzler , Sridhar Alla , Md. Rezaul Karim , Siamak Amirghodsi

data data-engineering apache-spark AI/ML Analytics Big Data

Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing framework Key Features Master the art of real-time big data processing and machine learning Explore a wide range of use-cases to analyze large data Discover ways to optimize your work by using many features of Spark 2.x and Scala Book Description Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform. You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools. By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle. This Learning Path includes content from the following Packt products: Mastering Apache Spark 2.x by Romeo Kienzler Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook What you will learn Get to grips with all the features of Apache Spark 2.x Perform highly optimized real-time big data processing Use ML and DL techniques with Spark MLlib and third-party tools Analyze structured and unstructured data using SparkSQL and GraphX Understand tuning, debugging, and monitoring of big data applications Build scalable and fault-tolerant streaming applications Develop scalable recommendation engines Who this book is for If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.

Dynamic Oracle Performance Analytics: Using Normalized Metrics to Improve Database Speed

2018-12-06 O'Reilly Amazon

book

Roger Cornejo

data data-engineering oracle-database-solutions Analytics Big Data Oracle

Use an innovative approach that relies on big data and advanced analytical techniques to analyze and improve Oracle Database performance. The approach used in this book represents a step-change paradigm shift away from traditional methods. Instead of relying on a few hand-picked, favorite metrics, or wading through multiple specialized tables of information such as those found in an automatic workload repository (AWR) report, you will draw on all available data, applying big data methods and analytical techniques to help the performance tuner draw impactful, focused performance improvement conclusions. This book briefly reviews past and present practices, along with available tools, to help you recognize areas where improvements can be made. The book then guides you through a step-by-step method that can be used to take advantage of all available metrics to identify problem areas and work toward improving them. The method presented simplifies the tuning process and solves the problem of metric overload. You will learn how to: collect and normalize data, generate deltas that are useful in performing statistical analysis, create and use a taxonomy to enhance your understanding of problem performance areas in your database and its applications, and create a root cause analysis report that enables understanding of a specific performance problem and its likely solutions. What You'll Learn Collect and prepare metrics for analysis from a wide array of sources Apply statistical techniques to select relevant metrics Create a taxonomy to provide additional insight into problem areas Provide a metrics-based root cause analysis regarding the performance issue Generate an actionable tuning plan prioritized according to problem areas Monitor performance using database-specific normal ranges Who This Book Is For Professional tuners: responsible for maintaining the efficient operation of large-scale databases who wish to focus on analysis, who want to expand their repertoire to include a big data methodology and use metrics without being overwhelmed, who desire to provide accurate root cause analysis and avoid the cyclical fix-test cycles that are inevitable when speculation is used

Hands-On Big Data Modeling

2018-11-30 O'Reilly Amazon

book

Tao Wei , Suresh Kumar Mukhiya , James Lee

data data-engineering data-models BI Big Data Data Management

This book, Hands-On Big Data Modeling, provides you with practical guidance on data modeling techniques, focusing particularly on the challenges of big data. You will learn the concepts behind various data models, explore tools and platforms for efficient data management, and gain hands-on experience with structured and unstructured data. What this Book will help me do Master the fundamental concepts of big data and its challenges. Explore advanced data modeling techniques using SQL, Python, and R. Design effective models for structured, semi-structured, and unstructured data types. Apply data modeling to real-world datasets like social media and sensor data. Optimize data models for performance and scalability in various big data platforms. Author(s) The authors of this book are experienced data architects and engineers with a strong background in developing scalable data solutions. They bring their collective expertise to simplify complex concepts in big data modeling, ensuring readers can effectively apply these techniques in their projects. Who is it for? This book is intended for data architects, business intelligence professionals, and any programmer interested in understanding and applying big data modeling concepts. If you are already familiar with basic data management principles and want to enhance your skills, this book is perfect for you. You will learn to tackle real-world datasets and create scalable models. Additionally, it is suitable for professionals transitioning to working with big data frameworks.

Hands-On Data Science with SQL Server 2017

2018-11-29 O'Reilly Amazon

book

Vladimír Mužný , Marek Chmel

data data-engineering SQL Analytics Azure BI

In "Hands-On Data Science with SQL Server 2017," you will discover how to implement end-to-end data analysis workflows, leveraging SQL Server's robust capabilities. This book guides you through collecting, cleaning, and transforming data, querying for insights, creating compelling visualizations, and even constructing predictive models for sophisticated analytics. What this Book will help me do Grasp the essential data science processes and how SQL Server supports them. Conduct data analysis and create interactive visualizations using Power BI. Build, train, and assess predictive models using SQL Server tools. Integrate SQL Server with R, Python, and Azure for enhanced functionality. Apply best practices for managing and transforming big data with SQL Server. Author(s) Marek Chmel and Vladimír Mužný bring their extensive experience in data science and database management to this book. Marek is a seasoned database specialist with a strong background in SQL, while Vladimír is known for his instructional expertise in analytics and data manipulation. Together, they focus on providing actionable insights and practical examples tailored for data professionals. Who is it for? This book is an ideal resource for aspiring and seasoned data scientists, data analysts, and database professionals aiming to deepen their expertise in SQL Server for data science workflows. Beginners with fundamental SQL knowledge will find it a guided entry into data science applications. It is especially suited for those who aim to implement data-driven solutions in their roles while leveraging SQL's capabilities.

Apache Hadoop 3 Quick Start Guide

2018-10-31 O'Reilly Amazon

book

Hrishikesh Vijay Karambelkar

data data-engineering Hadoop Analytics Big Data Data Analytics

Dive into the world of distributed data processing with the 'Apache Hadoop 3 Quick Start Guide.' This comprehensive resource equips you with the knowledge needed to handle large datasets effectively using Apache Hadoop. Learn how to set up and configure Hadoop, work with its core components, and explore its powerful ecosystem tools. What this Book will help me do Understand the fundamental concepts of Apache Hadoop, including HDFS, MapReduce, and YARN, and use them to store and process large datasets. Set up and configure Hadoop 3 in both developer and production environments to suit various deployment needs. Gain hands-on experience with Hadoop ecosystem tools like Hive, Kafka, and Spark to enhance your big data processing capabilities. Learn to manage, monitor, and troubleshoot Hadoop clusters efficiently to ensure smooth operations. Analyze real-time streaming data with tools like Apache Storm and perform advanced data analytics using Apache Spark. Author(s) The author of this guide, Vijay Karambelkar, brings years of experience working with big data technologies and Apache Hadoop in real-world applications. With a passion for teaching and simplifying complex topics, Vijay has compiled his expertise to help learners confidently approach Hadoop 3. His detailed, example-driven approach makes this book a practical resource for aspiring data professionals. Who is it for? This book is ideal for software developers, data engineers, and IT professionals who aspire to dive into the field of big data. If you're new to Apache Hadoop or looking to upgrade your skills to include version 3, this guide is for you. A basic understanding of Java programming is recommended to make the most of the topics covered. Embark on this journey to enhance your career in data-intensive industries.

Mastering Apache Cassandra 3.x - Third Edition

2018-10-31 O'Reilly Amazon

book

Tejaswi Malepati , Aaron Ploetz

data data-engineering nosql-databases Cassandra Analytics Big Data

This expert guide, "Mastering Apache Cassandra 3.x," is designed for individuals looking to achieve scalable and fault-tolerant database deployment using Apache Cassandra. From mastering the foundational components of Cassandra architecture to advanced topics like clustering and analytics integration with Apache Spark, this book equips readers with practical, actionable skills. What this Book will help me do Understand and deploy Apache Cassandra clusters for fault-tolerant and scalable databases. Use advanced features of CQL3 to streamline database queries and operations. Optimize and configure Cassandra nodes to improve performance for demanding applications. Monitor and manage Cassandra clusters effectively using best practices. Combine Cassandra with Apache Spark to build robust data analytics pipelines. Author(s) None Ploetz and None Malepati are experienced technologists and software professionals with extensive expertise in distributed database systems and big data algorithms. They've combined their industry knowledge and teaching backgrounds to create accessible and practical guides for learners worldwide. Their collaborative work is focused on demystifying complex systems for maximum learning impact. Who is it for? This book is ideal for database administrators, software developers, and big data specialists seeking to expand their skill set into scalable data storage using Cassandra. Readers should have a basic understanding of database concepts and some programming experience. If you're looking to design robust databases optimized for modern big data use-cases, this book will serve as a valuable resource.

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

2018-08-16 O'Reilly Amazon

book

Hien Luu

data data-engineering apache-spark AI/ML Analytics Big Data

Develop applications for the big data landscape with Spark and Hadoop. This book also explains the role of Spark in developing scalable machine learning and analytics applications with Cloud technologies. Beginning Apache Spark 2 gives you an introduction to Apache Spark and shows you how to work with it. Along the way, you’ll discover resilient distributed datasets (RDDs); use Spark SQL for structured data; and learn stream processing and build real-time applications with Spark Structured Streaming. Furthermore, you’ll learn the fundamentals of Spark ML for machine learning and much more. After you read this book, you will have the fundamentals to become proficient in using Apache Spark and know when and how to apply it to your big data applications. What You Will Learn Understand Spark unified data processing platform Howto run Spark in Spark Shell or Databricks Use and manipulate RDDs Deal with structured data using Spark SQL through its operations and advanced functions Build real-time applications using Spark Structured Streaming Develop intelligent applications with the Spark Machine Learning library Who This Book Is For Programmers and developers active in big data, Hadoop, and Java but who are new to the Apache Spark platform.

Streaming Systems

2018-07-23 O'Reilly Amazon

book

Slava Chernyak , Reuven Lax , Tyler Akidau

data data-engineering streaming-messaging streaming-architecture Big Data SQL

Streaming data is a big deal in big data these days. As more and more businesses seek to tame the massive unbounded data sets that pervade our world, streaming systems have finally reached a level of maturity sufficient for mainstream adoption. With this practical guide, data engineers, data scientists, and developers will learn how to work with streaming data in a conceptual and platform-agnostic way. Expanded from Tyler Akidau’s popular blog posts "Streaming 101" and "Streaming 102", this book takes you from an introductory level to a nuanced understanding of the what, where, when, and how of processing real-time data streams. You’ll also dive deep into watermarks and exactly-once processing with co-authors Slava Chernyak and Reuven Lax. You’ll explore: How streaming and batch data processing patterns compare The core principles and concepts behind robust out-of-order data processing How watermarks track progress and completeness in infinite datasets How exactly-once data processing techniques ensure correctness How the concepts of streams and tables form the foundations of both batch and streaming data processing The practical motivations behind a powerful persistent state mechanism, driven by a real-world example How time-varying relations provide a link between stream processing and the world of SQL and relational algebra

Apache Spark Deep Learning Cookbook

2018-07-13 O'Reilly Amazon

book

Ahmed Sherif , Amrith Ravindra , Michal Malohlava , Adnan Masood

data data-engineering apache-spark AI/ML Big Data Keras

Embark on a journey to master distributed deep learning with the "Apache Spark Deep Learning Cookbook". Designed specifically for leveraging the capabilities of Apache Spark, TensorFlow, and Keras, this book offers over 80 problem-solving recipes to efficiently train and deploy state-of-the-art neural networks, addressing real-world AI challenges. What this Book will help me do Set up and configure a working Apache Spark environment optimized for deep learning tasks. Implement distributed training practices for deep learning models using TensorFlow and Keras. Develop and test neural networks such as CNNs and RNNs targeting specific big data problems. Apply Spark's built-in libraries and integrations for enhanced NLP and computer vision applications. Effectively manage and preprocess large datasets using Spark DataFrames for machine learning tasks. Author(s) Authors Ahmed Sherif and None Ravindra bring years of experience in deep learning, Apache Spark use cases, and hands-on practical training. Their collective expertise has contributed to designing this cookbook approach, focusing on clarity and usability for readers tackling challenging machine learning scenarios. Who is it for? This book is ideal for IT professionals, data scientists, and software developers with foundational understanding of machine learning concepts and Apache Spark framework capabilities. If you aim to scale deep learning and integrate efficient computing with Spark's power, this guide is for you. Familiarity with Python will help maximize the book's potential.

Apache Hive Essentials - Second Edition

2018-06-30 O'Reilly Amazon

book

Dayong Du

data data-engineering Hadoop apache-hive Big Data Hive

"Apache Hive Essentials" provides a focused guide to mastering the essential techniques of processing and analyzing big data with Apache Hive. What this Book will help me do Set up and configure a Hive environment for big data analysis. Compose effective queries using Hive's SQL-like language to extract insights. Optimize Hive performance to handle complex datasets efficiently. Implement data security and user-defined functions to extend capabilities. Integrate Hive with Hadoop tools for comprehensive data solutions. Author(s) Dayong Du, the author of "Apache Hive Essentials," has years of experience working with big data technologies and tools. With hands-on expertise in Hadoop and the entire ecosystem, he brings a practical and informed perspective to this complex field. His approach is to make these technologies accessible to developers and analysts of all levels. Who is it for? This book is perfect for data analysts, developers, or professionals familiar with SQL who are looking to start with Apache Hive for big data processing. It is suitable for those acquainted with Hadoop and its environment and want to expand their skills into efficient data querying and management. Readers should have an interest in how to leverage big data tools for real-world solutions.

PySpark Cookbook

2018-06-29 O'Reilly Amazon

book

Denny Lee , Tomasz Drabas

data data-engineering apache-spark PySpark AI/ML Analytics

Dive into the world of big data processing and analytics with the "PySpark Cookbook". This book provides over 60 hands-on recipes for implementing efficient data-intensive solutions using Apache Spark and Python. By mastering these recipes, you'll be equipped to tackle challenges in large-scale data processing, machine learning, and stream analytics. What this Book will help me do Set up and configure PySpark environments effectively, including working with Jupyter for enhanced interactivity. Understand and utilize DataFrames for data manipulation, analysis, and transformation tasks. Develop end-to-end machine learning solutions using the ML and MLlib modules in PySpark. Implement structured streaming and graph-processing solutions to analyze and visualize data streams and relationships. Deploy PySpark applications to the cloud infrastructure efficiently using best practices. Author(s) This book is co-authored by None Lee and None Drabas, who are experienced professionals in data processing and analytics leveraging Python and Apache Spark. With their deep technical expertise and a passion for teaching through practical examples, they aim to make the complex concepts of PySpark accessible to developers of varied experience levels. Who is it for? This book is ideal for Python developers who are keen to delve into the Apache Spark ecosystem. Whether you're just starting with big data or have some experience with Spark, this book provides practical recipes to enhance your skills. Readers looking to solve real-world data-intensive challenges using PySpark will find this resource invaluable.

Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake

2018-06-27 O'Reilly Amazon

book

Saurabh Gupta , Venkata Giri

data data-engineering storage-repositories data-lake Big Data Data Lake

Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues. When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more. Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point. What You'll Learn Get to know data lake architecture and design principles Implement data capture and streaming strategies Implement data processing strategies in Hadoop Understand the data lake security framework and availability model Who This Book Is For Big data architects and solution architects

Big Data Architect???s Handbook

2018-06-21 O'Reilly Amazon

book

Syed Muhammad Fahad Akhtar

data data-engineering Hadoop AI/ML Big Data Cloud Computing

Big Data Architect's Handbook is your comprehensive guide to mastering the art of building sophisticated big data solutions. As you delve into this book, you'll learn to design end-to-end big data pipelines and integrate data from various sources for insightful analysis. What this Book will help me do Understand the Hadoop ecosystem and familiarize yourself with major Apache projects. Make informed decisions when designing cloud infrastructures for big data needs. Gain expertise in analyzing structured and unstructured data using machine learning. Develop skills to implement scalable and efficient big data pipelines. Enhance your ability to visualize and monitor data insights effectively. Author(s) None Akhtar has amassed a wealth of experience in big data architecture and related technologies. With years of hands-on involvement in development, analysis, and implementation of big data systems, None brings a pragmatic and insightful perspective. This passion for educating others about data-driven technologies shines through in a user-first approach to making complex topics accessible. Who is it for? This book caters to aspiring data professionals, software developers, and tech enthusiasts aiming to enhance their expertise in big data. Readers with basic programming and data analysis skills will find the content approachable yet challenging enough to deepen their understanding. If your career goal involves managing, analyzing, and making decisions based on large datasets, this book will help bridge the gap between skill and application.

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

2018-06-12 O'Reilly Amazon

book

Butch Quinto

data data-engineering Alteryx Analytics BI Big Data

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies. Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing. Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard. What You’ll Learn Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing Turbocharge Spark with Alluxio, a distributed in-memory storage platform Deploy big data in the cloud using Cloudera Director Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard Who This Book Is For BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Fifty Years of Data Management and Beyond

Mastering MongoDB 4.x - Second Edition

Hands-On Big Data Analytics with PySpark

Data Lake Maturity Model

AI and Big Data on IBM Power Systems Servers

The Enterprise Big Data Lake

Mastering Hadoop 3

IBM Elastic Storage Server Implementation Guide for Version 5.3

Apache Spark Quick Start Guide

Machine Learning with Apache Spark Quick Start Guide

Fast Data Architectures for Streaming Applications, 2nd Edition

Apache Spark 2: Data Processing and Real-Time Analytics

Dynamic Oracle Performance Analytics: Using Normalized Metrics to Improve Database Speed

Hands-On Big Data Modeling

Hands-On Data Science with SQL Server 2017

Apache Hadoop 3 Quick Start Guide

Mastering Apache Cassandra 3.x - Third Edition

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Streaming Systems

Apache Spark Deep Learning Cookbook

Apache Hive Essentials - Second Edition

PySpark Cookbook

Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake

Big Data Architect???s Handbook

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark