O'Reilly Data Engineering Books

Dynamic Oracle Performance Analytics: Using Normalized Metrics to Improve Database Speed

2018-12-06 O'Reilly Amazon

book

Roger Cornejo

data data-engineering oracle-database-solutions Analytics Big Data Oracle

Use an innovative approach that relies on big data and advanced analytical techniques to analyze and improve Oracle Database performance. The approach used in this book represents a step-change paradigm shift away from traditional methods. Instead of relying on a few hand-picked, favorite metrics, or wading through multiple specialized tables of information such as those found in an automatic workload repository (AWR) report, you will draw on all available data, applying big data methods and analytical techniques to help the performance tuner draw impactful, focused performance improvement conclusions. This book briefly reviews past and present practices, along with available tools, to help you recognize areas where improvements can be made. The book then guides you through a step-by-step method that can be used to take advantage of all available metrics to identify problem areas and work toward improving them. The method presented simplifies the tuning process and solves the problem of metric overload. You will learn how to: collect and normalize data, generate deltas that are useful in performing statistical analysis, create and use a taxonomy to enhance your understanding of problem performance areas in your database and its applications, and create a root cause analysis report that enables understanding of a specific performance problem and its likely solutions. What You'll Learn Collect and prepare metrics for analysis from a wide array of sources Apply statistical techniques to select relevant metrics Create a taxonomy to provide additional insight into problem areas Provide a metrics-based root cause analysis regarding the performance issue Generate an actionable tuning plan prioritized according to problem areas Monitor performance using database-specific normal ranges Who This Book Is For Professional tuners: responsible for maintaining the efficient operation of large-scale databases who wish to focus on analysis, who want to expand their repertoire to include a big data methodology and use metrics without being overwhelmed, who desire to provide accurate root cause analysis and avoid the cyclical fix-test cycles that are inevitable when speculation is used

Hands-On Data Science with SQL Server 2017

2018-11-29 O'Reilly Amazon

book

Vladimír Mužný , Marek Chmel

data data-engineering SQL Analytics Azure BI

In "Hands-On Data Science with SQL Server 2017," you will discover how to implement end-to-end data analysis workflows, leveraging SQL Server's robust capabilities. This book guides you through collecting, cleaning, and transforming data, querying for insights, creating compelling visualizations, and even constructing predictive models for sophisticated analytics. What this Book will help me do Grasp the essential data science processes and how SQL Server supports them. Conduct data analysis and create interactive visualizations using Power BI. Build, train, and assess predictive models using SQL Server tools. Integrate SQL Server with R, Python, and Azure for enhanced functionality. Apply best practices for managing and transforming big data with SQL Server. Author(s) Marek Chmel and Vladimír Mužný bring their extensive experience in data science and database management to this book. Marek is a seasoned database specialist with a strong background in SQL, while Vladimír is known for his instructional expertise in analytics and data manipulation. Together, they focus on providing actionable insights and practical examples tailored for data professionals. Who is it for? This book is an ideal resource for aspiring and seasoned data scientists, data analysts, and database professionals aiming to deepen their expertise in SQL Server for data science workflows. Beginners with fundamental SQL knowledge will find it a guided entry into data science applications. It is especially suited for those who aim to implement data-driven solutions in their roles while leveraging SQL's capabilities.

Apache Hadoop 3 Quick Start Guide

2018-10-31 O'Reilly Amazon

book

Hrishikesh Vijay Karambelkar

data data-engineering Hadoop Analytics Big Data Data Analytics

Dive into the world of distributed data processing with the 'Apache Hadoop 3 Quick Start Guide.' This comprehensive resource equips you with the knowledge needed to handle large datasets effectively using Apache Hadoop. Learn how to set up and configure Hadoop, work with its core components, and explore its powerful ecosystem tools. What this Book will help me do Understand the fundamental concepts of Apache Hadoop, including HDFS, MapReduce, and YARN, and use them to store and process large datasets. Set up and configure Hadoop 3 in both developer and production environments to suit various deployment needs. Gain hands-on experience with Hadoop ecosystem tools like Hive, Kafka, and Spark to enhance your big data processing capabilities. Learn to manage, monitor, and troubleshoot Hadoop clusters efficiently to ensure smooth operations. Analyze real-time streaming data with tools like Apache Storm and perform advanced data analytics using Apache Spark. Author(s) The author of this guide, Vijay Karambelkar, brings years of experience working with big data technologies and Apache Hadoop in real-world applications. With a passion for teaching and simplifying complex topics, Vijay has compiled his expertise to help learners confidently approach Hadoop 3. His detailed, example-driven approach makes this book a practical resource for aspiring data professionals. Who is it for? This book is ideal for software developers, data engineers, and IT professionals who aspire to dive into the field of big data. If you're new to Apache Hadoop or looking to upgrade your skills to include version 3, this guide is for you. A basic understanding of Java programming is recommended to make the most of the topics covered. Embark on this journey to enhance your career in data-intensive industries.

Mastering Apache Cassandra 3.x - Third Edition

2018-10-31 O'Reilly Amazon

book

Tejaswi Malepati , Aaron Ploetz

data data-engineering nosql-databases Cassandra Analytics Big Data

This expert guide, "Mastering Apache Cassandra 3.x," is designed for individuals looking to achieve scalable and fault-tolerant database deployment using Apache Cassandra. From mastering the foundational components of Cassandra architecture to advanced topics like clustering and analytics integration with Apache Spark, this book equips readers with practical, actionable skills. What this Book will help me do Understand and deploy Apache Cassandra clusters for fault-tolerant and scalable databases. Use advanced features of CQL3 to streamline database queries and operations. Optimize and configure Cassandra nodes to improve performance for demanding applications. Monitor and manage Cassandra clusters effectively using best practices. Combine Cassandra with Apache Spark to build robust data analytics pipelines. Author(s) None Ploetz and None Malepati are experienced technologists and software professionals with extensive expertise in distributed database systems and big data algorithms. They've combined their industry knowledge and teaching backgrounds to create accessible and practical guides for learners worldwide. Their collaborative work is focused on demystifying complex systems for maximum learning impact. Who is it for? This book is ideal for database administrators, software developers, and big data specialists seeking to expand their skill set into scalable data storage using Cassandra. Readers should have a basic understanding of database concepts and some programming experience. If you're looking to design robust databases optimized for modern big data use-cases, this book will serve as a valuable resource.

IBM z14 Model ZR1 Technical Introduction

2018-10-02 O'Reilly Amazon

book

Octavian Lascu

data data-engineering IBM Agile/Scrum Analytics Cloud Computing

Abstract This IBM® Redbooks® publication introduces the latest member of the IBM Z platform, the IBM z14 Model ZR1 (Machine Type 3907). It includes information about the Z environment and how it helps integrate data and transactions more securely, and provides insight for faster and more accurate business decisions. The z14 ZR1 is a state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to any digital transformation. The z14 ZR1 is designed for enhanced modularity, which is in an industry standard footprint. This system excels at the following tasks: Securing data with pervasive encryption Transforming a transactional platform into a data powerhouse Getting more out of the platform with IT Operational Analytics Providing resilience towards zero downtime Accelerating digital transformation with agile service delivery Revolutionizing business processes Mixing open source and Z technologies This book explains how this system uses new innovations and traditional Z strengths to satisfy growing demand for cloud, analytics, and open source technologies. With the z14 ZR1 as the base, applications can run in a trusted, reliable, and secure environment that improves operations and lessens business risk.

IBM z14 Technical Introduction

2018-10-02 O'Reilly Amazon

book

Octavian Lascu

data data-engineering IBM Agile/Scrum Analytics Cloud Computing

Abstract This IBM® Redbooks® publication introduces the latest IBM z platform, the IBM z14™. It includes information about the Z environment and how it helps integrate data and transactions more securely, and can infuse insight for faster and more accurate business decisions. The z14 is a state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to the digital era and the trust economy. This system includes the following functionality: Securing data with pervasive encryption Transforming a transactional platform into a data powerhouse Getting more out of the platform with IT Operational Analytics Providing resilience with key to zero downtime Accelerating digital transformation with agile service delivery Revolutionizing business processes Blending open source and Z technologies This book explains how this system uses both new innovations and traditional Z strengths to satisfy growing demand for cloud, analytics, and mobile applications. With the z14 as the base, applications can run in a trusted, reliable, and secure environment that both improves operations and lessens business risk.

Kafka Streams in Action

2018-09-19 O'Reilly Amazon

book

Bill Bejeck

data data-engineering streaming-messaging Kafka Analytics API

Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort. About the Technology Not all stream-based applications require a dedicated processing cluster. The lightweight Kafka Streams library provides exactly the power and simplicity you need for message handling in microservices and real-time event processing. With the Kafka Streams API, you filter and transform data streams with just Kafka and your application. About the Book Kafka Streams in Action teaches you to implement stream processing within the Kafka platform. In this easy-to-follow book, you’ll explore real-world examples to collect, transform, and aggregate data, work with multiple processors, and handle real-time events. You’ll even dive into streaming SQL with KSQL! Practical to the very end, it finishes with testing and operational aspects, such as monitoring and debugging. What's Inside Using the KStreams API Filtering, transforming, and splitting data Working with the Processor API Integrating with external systems About the Reader Assumes some experience with distributed systems. No knowledge of Kafka or streaming applications required. About the Author Bill Bejeck is a Kafka Streams contributor and Confluent engineer with over 15 years of software development experience. Quotes A great way to learn about Kafka Streams and how it is a key enabler of event-driven applications. - From the Foreword by Neha Narkhede, Cocreator of Apache Kafka A comprehensive guide to Kafka Streams—from introduction to production! - Bojan Djurkovic, Cvent Bridges the gap between message brokering and real-time streaming analytics. - Jim Mantheiy Jr., Next Century Valuable both as an introduction to streams as well as an ongoing reference. - Robin Coe, TD Bank

Data Science with SQL Server Quick Start Guide

2018-08-31 O'Reilly Amazon

book

Dejan Sarka

data data-engineering SQL AI/ML Analytics Data Science

"Data Science with SQL Server Quick Start Guide" introduces you to leveraging SQL Server's most recent features for data science projects. You will explore the integration of data science techniques using R, Python, and Transact-SQL within SQL Server's environment. What this Book will help me do Use SQL Server's capabilities for data science projects effectively. Understand and preprocess data using SQL queries and statistics. Design, train, and evaluate machine learning models in SQL Server. Visualize data insights through advanced graphing techniques. Deploy and utilize machine learning models within SQL Server environments. Author(s) Dejan Sarka is a data science and SQL Server expert with years of industry experience. He specializes in melding database systems with advanced analytics, offering practical guidance through real-world scenarios. His writing provides clear, step-by-step methods, making complex topics accessible. Who is it for? This book is tailored for professionals familiar with SQL Server who are looking to delve into data science. It is also ideal for data scientists aiming to incorporate SQL Server into their analytics workflows. The content assumes basic exposure to SQL Server, ensuring a straightforward learning curve for its audience.

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

2018-08-16 O'Reilly Amazon

book

Hien Luu

data data-engineering apache-spark AI/ML Analytics Big Data

Develop applications for the big data landscape with Spark and Hadoop. This book also explains the role of Spark in developing scalable machine learning and analytics applications with Cloud technologies. Beginning Apache Spark 2 gives you an introduction to Apache Spark and shows you how to work with it. Along the way, you’ll discover resilient distributed datasets (RDDs); use Spark SQL for structured data; and learn stream processing and build real-time applications with Spark Structured Streaming. Furthermore, you’ll learn the fundamentals of Spark ML for machine learning and much more. After you read this book, you will have the fundamentals to become proficient in using Apache Spark and know when and how to apply it to your big data applications. What You Will Learn Understand Spark unified data processing platform Howto run Spark in Spark Shell or Databricks Use and manipulate RDDs Deal with structured data using Spark SQL through its operations and advanced functions Build real-time applications using Spark Structured Streaming Develop intelligent applications with the Spark Machine Learning library Who This Book Is For Programmers and developers active in big data, Hadoop, and Java but who are new to the Apache Spark platform.

Introduction to IBM Common Data Provider for z Systems

2018-07-26 O'Reilly Amazon

book

Fabio Riva , Keith Miller , Michael Bonett , Eric Goodson , Domenico D'Alterio , John Strymecki , Matt Hunter , Volkmar Burke Siegemund

data data-engineering IBM Analytics

IBM Common Data Provider for z Systems collects, filters, and formats IT operational data in near real-time and provides that data to target analytics solutions. IBM Common Data Provider for z Systems enables authorized IT operations teams using a single web-based interface to specify the IT operational data to be gathered and how it needs to be handled. This data is provided to both on- and off-platform analytic solutions, in a consistent, consumable format for analysis. This Redpaper discusses the value of IBM Common Data Provider for z Systems, provides a high-level reference architecture for IBM Common Data Provider for z Systems, and introduces key components of the architecture. It shows how IBM Common Data Provider for z Systems provides operational data to various analytic solutions. The publication provides high-level integration guidance, preferred practices, tips on planning for IBM Common Data Provider for z Systems, and example integration scenarios.

Getting Started with Kudu

2018-07-09 O'Reilly Amazon

book

Brock Noland , Mladen Kovacevic , Jean-Marc Spaggiari , Ryan Bosshart

data data-engineering Hadoop kudu Analytics API

Fast data ingestion, serving, and analytics in the Hadoop ecosystem have forced developers and architects to choose solutions using the least common denominator—either fast analytics at the cost of slow data ingestion or fast data ingestion at the cost of slow analytics. There is an answer to this problem. With the Apache Kudu column-oriented data store, you can easily perform fast analytics on fast data. This practical guide shows you how. Begun as an internal project at Cloudera, Kudu is an open source solution compatible with many data processing frameworks in the Hadoop environment. In this book, current and former solutions professionals from Cloudera provide use cases, examples, best practices, and sample code to help you get up to speed with Kudu. Explore Kudu’s high-level design, including how it spreads data across servers Fully administer a Kudu cluster, enable security, and add or remove nodes Learn Kudu’s client-side APIs, including how to integrate Apache Impala, Spark, and other frameworks for data manipulation Examine Kudu’s schema design, including basic concepts and primitives necessary to make your project successful Explore case studies for using Kudu for real-time IoT analytics, predictive modeling, and in combination with another storage engine

PySpark Cookbook

2018-06-29 O'Reilly Amazon

book

Denny Lee , Tomasz Drabas

data data-engineering apache-spark PySpark AI/ML Analytics

Dive into the world of big data processing and analytics with the "PySpark Cookbook". This book provides over 60 hands-on recipes for implementing efficient data-intensive solutions using Apache Spark and Python. By mastering these recipes, you'll be equipped to tackle challenges in large-scale data processing, machine learning, and stream analytics. What this Book will help me do Set up and configure PySpark environments effectively, including working with Jupyter for enhanced interactivity. Understand and utilize DataFrames for data manipulation, analysis, and transformation tasks. Develop end-to-end machine learning solutions using the ML and MLlib modules in PySpark. Implement structured streaming and graph-processing solutions to analyze and visualize data streams and relationships. Deploy PySpark applications to the cloud infrastructure efficiently using best practices. Author(s) This book is co-authored by None Lee and None Drabas, who are experienced professionals in data processing and analytics leveraging Python and Apache Spark. With their deep technical expertise and a passion for teaching through practical examples, they aim to make the complex concepts of PySpark accessible to developers of varied experience levels. Who is it for? This book is ideal for Python developers who are keen to delve into the Apache Spark ecosystem. Whether you're just starting with big data or have some experience with Spark, this book provides practical recipes to enhance your skills. Readers looking to solve real-world data-intensive challenges using PySpark will find this resource invaluable.

Streaming Change Data Capture

2018-06-29 O'Reilly Amazon

book

Kevin Petrie , Dan Potter , Itamar Ankorion

data data-engineering storage-repositories data-lake Analytics Cloud Computing

There are many benefits to becoming a data-driven organization, including the ability to accelerate and improve business decision accuracy through the real-time processing of transactions, social media streams, and IoT data. But those benefits require significant changes to your infrastructure. You need flexible architectures that can copy data to analytics platforms at near-zero latency while maintaining 100% production uptime. Fortunately, a solution already exists. This ebook demonstrates how change data capture (CDC) can meet the scalability, efficiency, real-time, and zero-impact requirements of modern data architectures. Kevin Petrie, Itamar Ankorion, and Dan Potter—technology marketing leaders at Attunity—explain how CDC enables faster and more accurate decisions based on current data and reduces or eliminates full reloads that disrupt production and efficiency. The book examines: How CDC evolved from a niche feature of database replication software to a critical data architecture building block Architectures where data workflow and analysis take place, and their integration points with CDC How CDC identifies and captures source data updates to assist high-speed replication to one or more targets Case studies on cloud-based streaming and streaming to a data lake and related architectures Guiding principles for effectively implementing CDC in cloud, data lake, and streaming environments The Attunity Replicate platform for efficiently loading data across all major database, data warehouse, cloud, streaming, and Hadoop platforms

Hortonworks Data Platform with IBM Spectrum Scale: Reference Guide for Building an Integrated Solution

2018-06-26 O'Reilly Amazon

book

Muthu Muthiah , Wei G. Gong , Piyush Chaudhary , Larry Coyne , Yong ZY Zheng , Sandeep R Patil , Pallavi Galgali

data data-engineering IBM Analytics Data Lake Hadoop

This IBM® Redpaper™ publication provides guidance on building an enterprise-grade data lake by using IBM Spectrum™ Scale and Hortonworks Data Platform for performing in-place Hadoop or Spark-based analytics. It covers the benefits of the integrated solution, and gives guidance about the types of deployment models and considerations during the implementation of these models. Hortonworks Data Platform (HDP) is a leading Hadoop and Spark distribution. HDP addresses the complete needs of data-at-rest, powers real-time customer applications, and delivers robust analytics that accelerate decision making and innovation. IBM Spectrum Scale™ is flexible and scalable software-defined file storage for analytics workloads. Enterprises around the globe have deployed IBM Spectrum Scale to form large data lakes and content repositories to perform high-performance computing (HPC) and analytics workloads. It can scale performance and capacity both without bottlenecks.

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

2018-06-12 O'Reilly Amazon

book

Butch Quinto

data data-engineering Alteryx Analytics BI Big Data

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies. Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing. Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard. What You’ll Learn Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing Turbocharge Spark with Alluxio, a distributed in-memory storage platform Deploy big data in the cloud using Cloudera Director Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard Who This Book Is For BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics

Implementing IBM FlashSystem 900 Model AE3

2018-06-11 O'Reilly Amazon

book

Jim Cioffi Detlef Helmbrecht Jon Herd, Jeffrey Irving, Christian Karpp, Volker Kiemes, Carsten Larsen, Adrian Orban

data data-engineering IBM Analytics Cloud Computing

Abstract Today’s global organizations depend on being able to unlock business insights from massive volumes of data. Now, with IBM® FlashSystem 900 Model AE3, powered by IBM FlashCore® technology, they can make faster decisions based on real-time insights and unleash the power of the most demanding applications, including online transaction processing (OLTP) and analytics databases, virtual desktop infrastructures (VDIs), technical computing applications, and cloud environments. This IBM Redbooks® publication introduces clients to the IBM FlashSystem® 900 Model AE3. It provides in-depth knowledge of the product architecture, software and hardware, implementation, and hints and tips. Also illustrated are use cases that show real-world solutions for tiering, flash-only, and preferred-read, and also examples of the benefits gained by integrating the FlashSystem storage into business environments. This book is intended for pre-sales and post-sales technical support professionals and storage administrators, and for anyone who wants to understand how to implement this new and exciting technology.

IBM z14 Model ZR1 Technical Guide

2018-06-06 O'Reilly Amazon

book

Hervey Kamga Octavian Lascu Frank Packheiser, Martijn Raave, John Troy, Bill White

data data-engineering IBM Analytics Cloud Computing Cyber Security

Abstract This IBM® Redbooks® publication describes the new member of the IBM Z® family, IBM z14™ Model ZR1 (Machine Type 3907). It includes information about the Z environment and how it helps integrate data and transactions more securely, and can infuse insight for faster and more accurate business decisions. The z14 ZR1 is a state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to any digital transformation. The z14 ZR1 is designed for enhanced modularity, in an industry standard footprint. A data-centric infrastructure must always be available with a 99.999% or better availability, have flawless data integrity, and be secured from misuse. It also must be an integrated infrastructure that can support new applications. Finally, it must have integrated capabilities that can provide new mobile capabilities with real-time analytics that are delivered by a secure cloud infrastructure. IBM z14 ZR1 servers are designed with improved scalability, performance, security, resiliency, availability, and virtualization. The superscalar design allows z14 ZR1 servers to deliver a record level of capacity over the previous IBM Z platforms. In its maximum configuration, z14 ZR1 is powered by up to 30 client characterizable microprocessors (cores) running at 4.5 GHz. This configuration can run more than 29,000 million instructions per second and up to 8 TB of client memory. The IBM z14 Model ZR1 is estimated to provide up to 54% more total system capacity than the IBM z13s® Model N20. This Redbooks publication provides information about IBM z14 ZR1 and its functions, features, and associated software support. More information is offered in areas that are relevant to technical planning. It is intended for systems engineers, consultants, planners, and anyone who wants to understand the IBM Z servers functions and plan for their usage. It is intended as an introduction to mainframes. Readers are expected to be generally familiar with IBM Z technology and terminology.

Data Analytics with Spark Using Python, First edition

2018-06-04 O'Reilly Amazon

book

Jeffrey Aven

data data-engineering apache-spark AI/ML Analytics Cloud Computing

Spark for Data Professionals introduces and solidifies the concepts behind Spark 2.x, teaching working developers, architects, and data professionals exactly how to build practical Spark solutions. Jeffrey Aven covers all aspects of Spark development, including basic programming to SparkSQL, SparkR, Spark Streaming, Messaging, NoSQL and Hadoop integration. Each chapter presents practical exercises deploying Spark to your local or cloud environment, plus programming exercises for building real applications. Unlike other Spark guides, Spark for Data Professionals explains crucial concepts step-by-step, assuming no extensive background as an open source developer. It provides a complete foundation for quickly progressing to more advanced data science and machine learning topics. This guide will help you: Understand Spark basics that will make you a better programmer and cluster “citizen” Master Spark programming techniques that maximize your productivity Choose the right approach for each problem Make the most of built-in platform constructs, including broadcast variables, accumulators, effective partitioning, caching, and checkpointing Leverage powerful tools for managing streaming, structured, semi-structured, and unstructured data

Big Data Analytics with Hadoop 3

2018-05-31 O'Reilly Amazon

book

Sridhar Alla

data data-engineering Hadoop Analytics Flink AWS

Big Data Analytics with Hadoop 3 is your comprehensive guide to understanding and leveraging the power of Apache Hadoop for large-scale data processing and analytics. Through practical examples, it introduces the tools and techniques necessary to integrate Hadoop with other popular frameworks, enabling efficient data handling, processing, and visualization. What this Book will help me do Understand the foundational components and features of Apache Hadoop 3 such as HDFS, YARN, and MapReduce. Gain the ability to integrate Hadoop with programming languages like Python and R for data analysis. Learn the skills to utilize tools such as Apache Spark and Apache Flink for real-time data analytics within the Hadoop ecosystem. Develop expertise in setting up a Hadoop cluster and performing analytics in cloud environments such as AWS. Master the process of building practical big data analytics pipelines for end-to-end data processing. Author(s) Sridhar Alla is a seasoned big data professional with extensive industry experience in building and deploying scalable big data analytics solutions. Known for his expertise in Hadoop and related ecosystems, Sridhar combines technical depth with clear communication in his writing, providing practical insights and hands-on knowledge. Who is it for? This book is tailored for data professionals, software engineers, and data scientists looking to expand their expertise in big data analytics using Hadoop 3. Whether you're an experienced developer or new to the big data ecosystem, this book provides the step-by-step guidance and practical examples needed to advance your skills and achieve your analytical goals.

Hands-On Data Warehousing with Azure Data Factory

2018-05-31 O'Reilly Amazon

book

Christian Cote , Giuseppe Ciaburro , Michelle Gutzait

data data-engineering storage-repositories data-warehouse AI/ML Analytics

Dive into the world of ETL (Extract, Transform, Load) with 'Hands-On Data Warehousing with Azure Data Factory'. This book guides readers through the essential techniques for working with Azure Data Factory and SQL Server Integration Services to design, implement, and optimize ETL solutions for both on-premises and cloud data environments. What this Book will help me do Understand and utilize Azure Data Factory and SQL Server Integration Services to build ETL solutions. Design scalable and high-performance ETL architectures tailored to modern data problems. Integrate various Azure services, such as Azure Data Lake Analytics, Machine Learning, and Databricks Spark, into your workflows. Troubleshoot and optimize ETL pipelines and address common challenges in data processing. Create insightful Power BI dashboards to visualize and interact with data from your ETL workflows. Author(s) Authors None Cote, Michelle Gutzait, and Giuseppe Ciaburro bring a wealth of experience in data engineering and cloud technologies to this practical guide. Combining expertise in Azure ecosystem and hands-on Data Warehousing, they deliver actionable insights for working professionals. Who is it for? This book is crafted for software professionals working in data engineering, especially those specializing in ETL processes. Readers with a foundational knowledge of SQL Server and cloud infrastructures will benefit most. If you aspire to implement state-of-the-art ETL pipelines or enhance existing workflows with ADF and SSIS, this book is an ideal resource.

IBM Spectrum Scale Best Practices for Genomics Medicine Workloads

2018-04-25 O'Reilly Amazon

book

Monica Lemay , Kumaran Rajaram , Kevin Gildea , Piyush Chaudhary , Sandeep R Patil , Ulf Troppens , Joanna Wong , Luis Bolinches

data data-engineering IBM ibm-tivoli Analytics Data Management

Advancing the science of medicine by targeting a disease more precisely with treatment specific to each patient relies on access to that patient's genomics information and the ability to process massive amounts of genomics data quickly. Although genomics data is becoming a critical source for precision medicine, it is expected to create an expanding data ecosystem. Therefore, hospitals, genome centers, medical research centers, and other clinical institutes need to explore new methods of storing, accessing, securing, managing, sharing, and analyzing significant amounts of data. Healthcare and life sciences organizations that are running data-intensive genomics workloads on an IT infrastructure that lacks scalability, flexibility, performance, management, and cognitive capabilities also need to modernize and transform their infrastructure to support current and future requirements. IBM® offers an integrated solution for genomics that is based on composable infrastructure. This solution enables administrators to build an IT environment in a way that disaggregates the underlying compute, storage, and network resources. Such a composable building block based solution for genomics addresses the most complex data management aspect and allows organizations to store, access, manage, and share huge volumes of genome sequencing data. IBM Spectrum™ Scale is software-defined storage that is used to manage storage and provide massive scale, a global namespace, and high-performance data access with many enterprise features. IBM Spectrum Scale™ is used in clustered environments, provides unified access to data via file protocols (POSIX, NFS, and SMB) and object protocols (Swift and S3), and supports analytic workloads via HDFS connectors. Deploying IBM Spectrum Scale and IBM Elastic Storage™ Server (IBM ESS) as a composable storage building block in a Genomics Next Generation Sequencing deployment offers key benefits of performance, scalability, analytics, and collaboration via multiple protocols. This IBM Redpaper™ publication describes a composable solution with detailed architecture definitions for storage, compute, and networking services for genomics next generation sequencing that enable solution architects to benefit from tried-and-tested deployments, to quickly plan and design an end-to-end infrastructure deployment. The preferred practices and fully tested recommendations described in this paper are derived from running GATK Best Practices work flow from the Broad Institute. The scenarios provide all that is required, including ready-to-use configuration and tuning templates for the different building blocks (compute, network, and storage), that can enable simpler deployment and that can enlarge the level of assurance over the performance for genomics workloads. The solution is designed to be elastic in nature, and the disaggregation of the building blocks allows IT administrators to easily and optimally configure the solution with maximum flexibility. The intended audience for this paper is technical decision makers, IT architects, deployment engineers, and administrators who are working in the healthcare domain and who are working on genomics-based workloads.

A Deep Dive into NoSQL Databases: The Use Cases and Applications

2018-04-20 O'Reilly Amazon

book

Pethuru Raj , Ganesh Chandra Deka

data data-engineering nosql-databases Analytics Big Data Data Analytics

A Deep Dive into NoSQL Databases: The Use Cases and Applications, Volume 109, the latest release in the Advances in Computers series first published in 1960, presents detailed coverage of innovations in computer hardware, software, theory, design and applications. In addition, it provides contributors with a medium in which they can explore their subjects in greater depth and breadth. This update includes sections on NoSQL and NewSQL databases for big data analytics and distributed computing, NewSQL databases and scalable in-memory analytics, NoSQL web crawler application, NoSQL Security, a Comparative Study of different In-Memory (No/New)SQL Databases, NoSQL Hands On-4 NoSQLs, the Hadoop Ecosystem, and more. Provides a very comprehensive, yet compact, book on the popular domain of NoSQL databases for IT professionals, practitioners and professors Articulates and accentuates big data analytics and how it gets simplified and streamlined by NoSQL database systems Sets a stimulating foundation with all the relevant details for NoSQL database researchers, developers and administrators

Implementing IBM FlashSystem V9000 AE3

2018-04-12 O'Reilly Amazon

book

Christian Karpp , Jon Herd , James Cioffi , Detlef Helmbrecht , Carsten Larsen , Jeffrey Irving , Adrian Orban , Volker Kiemes

data data-engineering IBM Analytics Cloud Computing Data Management

Abstract The success or failure of businesses often depends on how well organizations use their data assets for competitive advantage. Deeper insights from data require better information technology. As organizations modernize their IT infrastructure to boost innovation rather than limit it, they need a data storage system that can keep pace with several areas that affect your business: Highly virtualized environments Cloud computing Mobile and social systems of engagement In-depth, real-time analytics Making the correct decision on storage investment is critical. Organizations must have enough storage performance and agility to innovate when they need to implement cloud-based IT services, deploy virtual desktop infrastructure, enhance fraud detection, and use new analytics capabilities. At the same time, future storage investments must lower IT infrastructure costs while helping organizations to derive the greatest possible value from their data assets. The IBM® FlashSystem V9000 is the premier, fully integrated, Tier 1, all-flash offering from IBM. It has changed the economics of today's data center by eliminating storage bottlenecks. Its software-defined storage features simplify data management, improve data security, and preserve your investments in storage. The IBM FlashSystem® V9000 SAS expansion enclosures provide new tiering options with read-intensive SSDs or nearline SAS HDDs. IBM FlashSystem V9000 includes IBM FlashCore® technology and advanced software-defined storage available in one solution in a compact 6U form factor. IBM FlashSystem V9000 improves business application availability. It delivers greater resource utilization so you can get the most from your storage resources, and achieve a simpler, more scalable, and cost-efficient IT Infrastructure. This IBM Redbooks® publication provides information about IBM FlashSystem V9000 Software V8.1. It describes the core product architecture, software, hardware, and implementation, and provides hints and tips. The underlying basic hardware and software architecture and features of the IBM FlashSystem V9000 AC3 control enclosure and on IBM Spectrum Virtualize 8.1 software are described in these publications: Implementing IBM FlashSystem 900 Model AE3, SG24-8414 Implementing the IBM System Storage SAN Volume Controller V7.4, SG24-7933 Using IBM FlashSystem V9000 software functions, management tools, and interoperability combines the performance of IBM FlashSystem architecture with the advanced functions of software-defined storage to deliver performance, efficiency, and functions that meet the needs of enterprise workloads that demand IBM MicroLatency® response time. This book offers IBM FlashSystem V9000 scalability concepts and guidelines for planning, installing, and configuring, which can help environments scale up and out to add more flash capacity and expand virtualized systems. Port utilization methodologies are provided to help you maximize the full potential of IBM FlashSystem V9000 performance and low latency in your scalable environment. This book is intended for pre-sales and post-sales technical support professionals, storage administrators, and anyone who wants to understand how to implement this exciting technology.

IBM z14 Model ZR1 Technical Introduction

2018-04-10 O'Reilly Amazon

book

Frank Packheiser , John Troy , Bill White , Octavian Lascu , Hervey Kamga , Martijn Raave

data data-engineering IBM Agile/Scrum Analytics Cloud Computing

Abstract This IBM® Redbooks® publication introduces the latest member of the IBM Z platform, the IBM z14 Model ZR1 (Machine Type 3907). It includes information about the Z environment and how it helps integrate data and transactions more securely, and provides insight for faster and more accurate business decisions. The z14 ZR1 is a state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to any digital transformation. The z14 ZR1 is designed for enhanced modularity, which is in an industry standard footprint. This system excels at the following tasks: Securing data with pervasive encryption Transforming a transactional platform into a data powerhouse Getting more out of the platform with IT Operational Analytics Providing resilience towards zero downtime Accelerating digital transformation with agile service delivery Revolutionizing business processes Mixing open source and Z technologies This book explains how this system uses new innovations and traditional Z strengths to satisfy growing demand for cloud, analytics, and open source technologies. With the z14 ZR1 as the base, applications can run in a trusted, reliable, and secure environment that improves operations and lessens business risk.

Modern Big Data Processing with Hadoop

2018-03-30 O'Reilly Amazon

book

Prashant Shindgikar , Manoj R Patil , V Naresh Kumar

data data-engineering Hadoop Analytics Big Data Cloud Computing

Delve into the world of big data with 'Modern Big Data Processing with Hadoop.' This comprehensive guide introduces you to the powerful capabilities of Apache Hadoop and its ecosystem to solve data processing and analytics challenges. By the end, you will have mastered the techniques necessary to architect innovative, scalable, and efficient big data solutions. What this Book will help me do Master the principles of building an enterprise-level big data strategy with Apache Hadoop. Learn to integrate Hadoop with tools such as Apache Spark, Elasticsearch, and more for comprehensive solutions. Set up and manage your big data architecture, including deployment on cloud platforms with Apache Ambari. Develop real-time data pipelines and enterprise search solutions. Leverage advanced visualization tools like Apache Superset to make sense of data insights. Author(s) None R. Patil, None Kumar, and None Shindgikar are experienced big data professionals and accomplished authors. With years of hands-on experience in implementing and managing Apache Hadoop systems, they bring a depth of expertise to their writing. Their dedication lies in making complex technical concepts accessible while demonstrating real-world best practices. Who is it for? This book is designed for data professionals aiming to advance their expertise in big data solutions using Apache Hadoop. Ideal readers include engineers and project managers involved in data architecture and those aspiring to become big data architects. Some prior exposure to big data systems is beneficial to fully benefit from this book's insights and tutorials.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Dynamic Oracle Performance Analytics: Using Normalized Metrics to Improve Database Speed

Hands-On Data Science with SQL Server 2017

Apache Hadoop 3 Quick Start Guide

Mastering Apache Cassandra 3.x - Third Edition

IBM z14 Model ZR1 Technical Introduction

IBM z14 Technical Introduction

Kafka Streams in Action

Data Science with SQL Server Quick Start Guide

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Introduction to IBM Common Data Provider for z Systems

Getting Started with Kudu

PySpark Cookbook

Streaming Change Data Capture

Hortonworks Data Platform with IBM Spectrum Scale: Reference Guide for Building an Integrated Solution

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Implementing IBM FlashSystem 900 Model AE3

IBM z14 Model ZR1 Technical Guide

Data Analytics with Spark Using Python, First edition

Big Data Analytics with Hadoop 3

Hands-On Data Warehousing with Azure Data Factory

IBM Spectrum Scale Best Practices for Genomics Medicine Workloads

A Deep Dive into NoSQL Databases: The Use Cases and Applications

Implementing IBM FlashSystem V9000 AE3

IBM z14 Model ZR1 Technical Introduction

Modern Big Data Processing with Hadoop