O'Reilly Data Engineering Books

IBM Software Defined Environment

2015-08-12 O'Reilly Amazon

book

Dino Quintero , Ashish Nainwal , Fabio Martins , Marcin Tabinowski , William M Genovese , KiWaon Kim , Dusan Smolej , Ming Jun MJ Li

data data-engineering IBM Analytics Cloud Computing

This IBM® Redbooks® publication introduces the IBM Software Defined Environment (SDE) solution, which helps to optimize the entire computing infrastructure--compute, storage, and network resources--so that it can adapt to the type of work required. In today's environment, resources are assigned manually to workloads, but that happens automatically in a SDE. In an SDE, workloads are dynamically assigned to IT resources based on application characteristics, best-available resources, and service level policies so that they deliver continuous, dynamic optimization and reconfiguration to address infrastructure issues. Underlying all of this are policy-based compliance checks and updates in a centrally managed environment. Readers get a broad introduction to the new architecture. Think integration, automation, and optimization. Those are enablers of cloud delivery and analytics. SDE can accelerate business success by matching workloads and resources so that you have a responsive, adaptive environment. With the IBM Software Defined Environment, infrastructure is fully programmable to rapidly deploy workloads on optimal resources and to instantly respond to changing business demands. This information is intended for IBM sales representatives, IBM software architects, IBM Systems Technology Group brand specialists, distributors, resellers, and anyone who is developing or implementing SDE.

Spark Cookbook

2015-07-27 O'Reilly Amazon

book

Rishi Yadav

data data-engineering apache-spark AI/ML Analytics Big Data

Spark Cookbook is your practical guide to mastering Apache Spark, encompassing a comprehensive set of patterns and examples. Through its over 60 recipes, you will gain actionable insights into using Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX effectively for your big data needs. What this Book will help me do Understand how to install and configure Apache Spark in various environments. Build data pipelines and perform real-time analytics with Spark Streaming. Utilize Spark SQL for interactive data querying and reporting. Apply machine learning workflows using MLlib, including supervised and unsupervised models. Develop optimized big data solutions and integrate them into enterprise platforms. Author(s) None Yadav, the author of Spark Cookbook, is an experienced data engineer and technical expert with deep insights into big data processing frameworks. Yadav has spent years working with Spark and its ecosystem, providing practical guidance to developers and data scientists alike. This book reflects their commitment to sharing actionable knowledge. Who is it for? This book is designed for data engineers, developers, and data scientists who work with big data systems and wish to utilize Apache Spark effectively. Whether you're looking to optimize existing Spark applications or explore its libraries for new use cases, this book will provide the guidance you need. A basic familiarity with big data concepts and programming in languages like Java or Python is recommended to make the most out of this book.

ElasticSearch Blueprints

2015-07-24 O'Reilly Amazon

book

Vineeth Mohan

data data-engineering search elasticsearch Analytics ELK

Dive into search technology with "ElasticSearch Blueprints"! This is the perfect project-based guide to help you master Elasticsearch. You will learn how to build and design scalable, effective search solutions, improve search relevancy, manage data efficiently, perform analytics, and visualize your data in comprehensive ways. What this Book will help me do Build and fine-tune scalable search engine features with Elasticsearch. Design and implement accurate ecommerce search solutions using filters. Analyze and visualize data with Elasticsearch's powerful data aggregation capabilities. Increase search relevancy and enhance user query assistance using analyzers. Incorporate enhanced data organization methods, including parent-child relationships. Author(s) None Mohan is an experienced professional specializing in search technologies. With a strong technical background, they have engaged deeply with Elasticsearch, creating solutions that address practical challenges. Their approach focuses on making technical topics accessible, guiding readers step-by-step through projects. Who is it for? This book is tailored for data professionals, application developers, and enthusiasts eager to delve into search technologies. Whether you're beginning with Elasticsearch or aiming to refine your skills, this guide will advance your expertise. By working through practical cases, you'll gain confidence in using Elasticsearch effectively to meet diverse requirements.

IBM Software Defined Infrastructure for Big Data Analytics Workloads

2015-06-29 O'Reilly Amazon

book

Marcelo Correia Lima , Dino Quintero , Maciej Olejniczak , Daniel de Souza Casali , Istvan Gabor Szabo , Nilton Carlos dos Santos , Tiago Rodrigues de Mello

data data-engineering IBM ibm-power-systems Analytics Big Data

This IBM® Redbooks® publication documents how IBM Platform Computing, with its IBM Platform Symphony® MapReduce framework, IBM Spectrum Scale (based Upon IBM GPFS™), IBM Platform LSF®, the Advanced Service Controller for Platform Symphony are work together as an infrastructure to manage not just Hadoop-related offerings, but many popular industry offeringsm such as Apach Spark, Storm, MongoDB, Cassandra, and so on. It describes the different ways to run Hadoop in a big data environment, and demonstrates how IBM Platform Computing solutions, such as Platform Symphony and Platform LSF with its MapReduce Accelerator, can help performance and agility to run Hadoop on distributed workload managers offered by IBM. This information is for technical professionals (consultants, technical support staff, IT architects, and IT specialists) who are responsible for delivering cost-effective cloud services and big data solutions on IBM Power Systems™ to help uncover insights among client’s data so they can optimize product development and business results.

Implementing an IBM InfoSphere BigInsights Cluster using Linux on Power

2015-06-16 O'Reilly Amazon

book

Dino Quintero , Ichsan Mulia Permata , Peter McCullagh , Pablo Barquero Garro , Franz Friedrich Liebinger Portela , Joanna Wong , Peng Jiang , Luis Carlos Cruz Huertas , John Wright , Esteban Arias Navarro , Rodrigo Ceron Ferreira de Castro

data data-engineering IBM infosphere Analytics Big Data

This IBM® Redbooks® publication demonstrates and documents how to implement and manage an IBM PowerLinux™ cluster for big data focusing on hardware management, operating systems provisioning, application provisioning, cluster readiness check, hardware, operating system, IBM InfoSphere® BigInsights™, IBM Platform Symphony®, IBM Spectrum™ Scale (formerly IBM GPFS™), applications monitoring, and performance tuning. This publication shows that IBM PowerLinux clustering solutions (hardware and software) deliver significant value to clients that need cost-effective, highly scalable, and robust solutions for big data and analytics workloads. This book documents and addresses topics on how to use IBM Platform Cluster Manager to manage PowerLinux BigData data clusters through IBM InfoSphere BigInsights, Spectrum Scale, and Platform Symphony. This book documents how to set up and manage a big data cluster on PowerLinux servers to customize application and programming solutions, and to tune applications to use IBM hardware architectures. This document uses the architectural technologies and the software solutions that are available from IBM to help solve challenging technical and business problems. This book is targeted at technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) that are responsible for delivering cost-effective Linux on IBM Power Systems™ solutions that help uncover insights among client's data so they can act to optimize business results, product development, and scientific discoveries.

Implementing IBM FlashSystem 900

2015-05-27 O'Reilly Amazon

book

Karen Orlando , Jon Herd , Detlef Helmbrecht , Carsten Larsen , Matt Levan

data data-engineering IBM Analytics Cloud Computing

Today's global organizations depend on being able to unlock business insights from massive volumes of data. Now, with IBM® FlashSystem™ 900, powered by IBM FlashCore™ technology, they can make faster decisions based on real-time insights and unleash the power of the most demanding applications, including online transaction processing (OLTP) and analytics databases, virtual desktop infrastructures (VDIs), technical computing applications, and cloud environments. This IBM Redbooks® publication introduces clients to the IBM FlashSystem® 900. It provides in-depth knowledge of the product architecture, software and hardware, implementation, and hints and tips. Also illustrated are use cases that show real-world solutions for tiering, flash-only, and preferred-read, and also examples of the benefits gained by integrating the FlashSystem storage into business environments. This book is intended for pre-sales and post-sales technical support professionals and storage administrators, and for anyone who wants to understand how to implement this new and exciting technology. This book describes the following offerings of the IBM Spectrum™ Storage family: IBM Spectrum Storage™ IBM Spectrum Control IBM Spectrum Virtualize IBM Spectrum Scale IBM Spectrum Accelerate

Designing and Operating a Data Reservoir

2015-05-26 O'Reilly Amazon

book

Jay Limburn , Mandy Chessell , Nigel L Jones , David Radley , Kevin Shank

data data-engineering Analytics Big Data HTML IBM

Together, big data and analytics have tremendous potential to improve the way we use precious resources, to provide more personalized services, and to protect ourselves from unexpected and ill-intentioned activities. To fully use big data and analytics, an organization needs a system of insight. This is an ecosystem where individuals can locate and access data, and build visualizations and new analytical models that can be deployed into the IT systems to improve the operations of the organization. The data that is most valuable for analytics is also valuable in its own right and typically contains personal and private information about key people in the organization such as customers, employees, and suppliers. Although universal access to data is desirable, safeguards are necessary to protect people's privacy, prevent data leakage, and detect suspicious activity. The data reservoir is a reference architecture that balances the desire for easy access to data with information governance and security. The data reservoir reference architecture describes the technical capabilities necessary for a system of insight, while being independent of specific technologies. Being technology independent is important, because most organizations already have investments in data platforms that they want to incorporate in their solution. In addition, technology is continually improving, and the choice of technology is often dictated by the volume, variety, and velocity of the data being managed. A system of insight needs more than technology to succeed. The data reservoir reference architecture includes description of governance and management processes and definitions to ensure the human and business systems around the technology support a collaborative, self-service, and safe environment for data use. The data reservoir reference architecture was first introduced in Governing and Managing Big Data for Analytics and Decision Makers, REDP-5120, which is available at: http://www.redbooks.ibm.com/redpieces/abstracts/redp5120.html. This IBM® Redbooks publication, Designing and Operating a Data Reservoir, builds on that material to provide more detail on the capabilities and internal workings of a data reservoir.

IBM Spectrum Scale (formerly GPFS)

2015-05-26 O'Reilly Amazon

book

Dino Quintero , Carlos Henrique Fachim , Andrei Socoliuc , Willard Davis , Olaf Weiser , Steve Duersch , Puneet Chaudhary , Luis Bolinches

data data-engineering IBM ibm-spectrum-control Analytics Big Data

This IBM® Redbooks® publication updates and complements the previous publication: Implementing the IBM General Parallel File System in a Cross Platform Environment, SG24-7844, with additional updates since the previous publication version was released with IBM General Parallel File System (GPFS™). Since then, two releases have been made available up to the latest version of IBM Spectrum™ Scale 4.1. Topics such as what is new in Spectrum Scale, Spectrum Scale licensing updates (Express/Standard/Advanced), Spectrum Scale infrastructure support/updates, storage support (IBM and OEM), operating system and platform support, Spectrum Scale global sharing - Active File Management (AFM), and considerations for the integration of Spectrum Scale in IBM Tivoli® Storage Manager (Spectrum Protect) backup solutions are discussed in this new IBM Redbooks publication. This publication provides additional topics such as planning, usability, best practices, monitoring, problem determination, and so on. The main concept for this publication is to bring you up to date with the latest features and capabilities of IBM Spectrum Scale as the solution has become a key component of the reference architecture for clouds, analytics, mobile, social media, and much more. This publication targets technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) responsible for delivering cost effective cloud services and big data solutions on IBM Power Systems™ helping to uncover insights among clients' data so they can take actions to optimize business results, product development, and scientific discoveries.

Implementation Best Practices for IBM DB2 BLU Acceleration with SAP BW on IBM Power Systems

2015-05-11 O'Reilly Amazon

book

Dino Quintero , Adriana Melges Quintanilha Weingart , Speitim Velic , Yukiko Itaya

data data-engineering relational-databases ibm-db2 Analytics Data Analytics

BLU Acceleration is a new technology that has been developed by IBM® and integrated directly into the IBM DB2® engine. BLU Acceleration is a new storage engine along with integrated run time (directly into the core DB2 engine) to support the storage and analysis of column-organized tables. The BLU Acceleration processing is parallel to the regular, row-based table processing found in the DB2 engine. This is not a bolt-on technology nor is it a separate analytic engine that sits outside of DB2. Much like when IBM added XML data as a first class object within the database along with all the storage and processing enhancements that came with XML, now IBM has added column-organized tables directly into the storage and processing engine of DB2. This IBM Redbooks® publication shows examples on an IBM Power Systems™ entry server as a starter configuration for small organizations, and build larger configurations with IBM Power Systems larger servers. This publication takes you through how to build a BLU Acceleration solution on IBM POWER® having SAP Landscape integrated to it. This publication implements SAP NetWeaver Business Warehouse Systems as part of the scenario using another DB2 Feature called Near-Line Storage (NLS), on IBM POWER virtualization features to develop and document best recommendation scenarios. This publication is targeted towards technical professionals (DBAs, data architects, consultants, technical support staff, and IT specialists) responsible for delivering cost-effective data management solutions to provide the best system configuration for their clients' data analytics on Power Systems.

Big Data

2015-04-30 O'Reilly Amazon

book

James Warren , Nathan Marz

data data-engineering AI/ML Analytics AWS Lambda Big Data

Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built. About the Technology About the Book Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive. Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm, and NoSQL databases. What's Inside Introduction to big data systems Real-time processing of web-scale data Tools like Hadoop, Cassandra, and Storm Extensions to traditional database skills About the Reader This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful. About the Authors Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. James Warren is an analytics architect with a background in machine learning and scientific computing. Quotes Transcends individual tools or platforms. Required reading for anyone working with big data systems. - Jonathan Esterhazy, Groupon A comprehensive, example-driven tour of the Lambda Architecture with its originator as your guide. - Mark Fisher, Pivotal Contains wisdom that can only be gathered after tackling many big data projects. A must-read. - Pere Ferrera Bertran, Datasalt The de facto guide to streamlining your data pipeline in batch and near-real time. - Alex Holmes, Author of "Hadoop in Practice"

Hadoop Essentials

2015-04-29 O'Reilly Amazon

book

Shiva Achari

data data-engineering Hadoop Analytics Big Data Data Analytics

In 'Hadoop Essentials,' you'll embark on an engaging journey to master the Hadoop ecosystem. This book covers fundamental to advanced topics, from HDFS and MapReduce to real-time analytics with Spark, empowering you to handle modern data challenges efficiently. What this Book will help me do Understand the core components of Hadoop, including HDFS, YARN, and MapReduce, for foundational knowledge. Learn to optimize Big Data architectures and improve application performance. Utilize tools like Hive and Pig for efficient data querying and processing. Master data ingestion technologies like Sqoop and Flume for seamless data management. Achieve fluency in real-time data analytics using modern tools like Apache Spark and Apache Storm. Author(s) None Achari is a seasoned expert in Big Data and distributed systems with in-depth knowledge of the Hadoop ecosystem. With years of experience in both development and teaching, they craft content that bridges practical know-how with theoretical insights in a highly accessible style. Who is it for? This book is perfect for system and application developers aiming to learn practical applications of Hadoop. It suits professionals seeking solutions to real-world Big Data challenges as well as those familiar with distributed systems basics and looking to deepen their expertise in advanced data analysis.

Apache Solr Search Patterns

2015-04-24 O'Reilly Amazon

book

Jayant Kumar

data data-engineering search solr Analytics ELK

Master Elasticsearch as you uncover advanced Solr techniques in this professional guide. This book dives deeply into deploying and optimizing Solr-powered search engines and explores high-performance techniques. Learn to leverage your data with accessible, comprehensive, and practical insights. What this Book will help me do Learn to customize Solr's query scorer to provide tailored search results. Understand the internals of Solr, including indexing and query facilities, for better optimization. Implement scalable and reliable search clusters using SolrCloud. Explore the use of Solr for spatial, e-commerce, and advertising searches. Combine Solr with front-end technologies like AJAX and advanced tagging with FSTs. Author(s) Jayant Kumar, an experienced developer and search solutions architect, specializes in leveraging Apache Solr. With years of practical experience, he brings unique insights into scaling search platforms. His commitment to imparting clear, actionable knowledge is reflected in this focused resource. Who is it for? This book is ideal for software developers and architects embedded in the Solr ecosystem looking to enhance their expertise. If you are seeking to develop advanced and scalable solutions, master Solr's core capabilities, or improve your analytics and graph-generating skills, this book will support your goals.

IBM z13 Technical Guide

2015-04-17 O'Reilly Amazon

book

Hans-Peter Eckam , Frank Packheiser , Ewerson Palacio , Parwez Hamid , Steven LaFalce , Rakesh Krishnakumar , Octavian Lascu , Maurício Andozia Nogueira , Erik Bakker , Lourenço Luitgards Moura Neto , Andre Spahni , Giancarlo Rodolfi

data data-engineering IBM Analytics Cloud Computing Cyber Security

Digital business has been driving the transformation of underlying IT infrastructure to be more efficient, secure, adaptive, and integrated. Information Technology (IT) must be able to handle the explosive growth of mobile clients and employees. IT also must be able to use enormous amounts of data to provide deep and real-time insights to help achieve the greatest business impact. This IBM® Redbooks® publication addresses the new IBM Mainframe, the IBM z13. The IBM z13 is the trusted enterprise platform for integrating data, transactions, and insight. A data-centric infrastructure must always be available with a 99.999% or better availability, have flawless data integrity, and be secured from misuse. It needs to be an integrated infrastructure that can support new applications. It needs to have integrated capabilities that can provide new mobile capabilities with real-time analytics delivered by a secure cloud infrastructure. IBM z13 is designed with improved scalability, performance, security, resiliency, availability, and virtualization. The superscalar design allows the z13 to deliver a record level of capacity over the prior z Systems. In its maximum configuration, z13 is powered by up to 141 client characterizable microprocessors (cores) running at 5 GHz. This configuration can run more than 110,000 millions of instructions per second (MIPS) and up to 10 TB of client memory. The IBM z13 Model NE1 is estimated to provide up to 40% more total system capacity than the IBM zEnterprise® EC12 (zEC1) Model HA1. This book provides information about the IBM z13 and its functions, features, and associated software support. Greater detail is offered in areas relevant to technical planning. It is intended for systems engineers, consultants, planners, and anyone who wants to understand the IBM z Systems functions and plan for their usage. It is not intended as an introduction to mainframes. Readers are expected to be generally familiar with existing IBM z Systems technology and terminology.

The Security Data Lake

2015-04-15 O'Reilly Amazon

book

Raffael Marty

data data-engineering storage-repositories data-lake Analytics Data Analytics

Companies of all sizes are considering data lakes as a way to deal with terabytes of security data that can help them conduct forensic investigations and serve as an early indicator to identify bad or relevant behavior. Many think about replacing their existing SIEM (security information and event management) systems with Hadoop running on commodity hardware. Before your company jumps into the deep end, you first need to weigh several critical factors. This O'Reilly report takes you through technological and design options for implementing a data lake. Each option not only supports your data analytics use cases, but is also accessible by processes, workflows, third-party tools, and teams across your organization. Within this report, you'll explore: Five questions to ask before choosing architecture for your backend data store How data lakes can overcome scalability and data duplication issues Different options for storing context and unstructured log data Data access use cases covering both search and analytical queries via SQL Processes necessary for ingesting data into a data lake, including parsing, enrichment, and aggregation Four methods for embedding your SIEM into a data lake

Advanced Analytics with Spark

2015-04-10 O'Reilly Amazon

book

Josh Wills , Sandy Ryza , Sean Owen , Uri Laserson

data data-engineering apache-spark AI/ML Analytics Java

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications.

IBM z13 Technical Introduction

2015-03-21 O'Reilly Amazon

book

Hans-Peter Eckam , Frank Packheiser , Ewerson Palacio , Mauricio Andozia Nogueira , André Spahni , Parwez Hamid , Steven LaFalce , Rakesh Krishnakumar , Octavian Lascu , Erik Bakker , Lourenço Luitgards Moura Neto , Giancarlo Rodolfi

data data-engineering IBM Analytics Cloud Computing

This IBM® Redbooks® publication introduces the IBM z13™. IBM z13 delivers a data and transaction system reinvented as a system of insight for digital business. IBM z Systems™ leadership is extended with these features: Improved ability to meet service level agreements with new processor chip technology that includes simultaneous multithreading, analytical vector processing, redesigned and larger cache, and enhanced accelerators for hardware compression and cryptography Better availability and more efficient use of critical data with up to 10 TB available redundant array of independent memory (RAIM) Validation of transactions, management, and assignment of business priority for SAN devices through updates to the I/O subsystem Continued management of heterogeneous workloads with IBM z BladeCenter Extension (zBX) Model 004 and IBM z Unified Resource Manager This Redbooks publication can help you become familiar with the z Systems platform, and understand how the platform can help integrate data, transactions, and insight for faster and more accurate business decisions. This book explains how, with innovations and traditional strengths, IBM z13 can play an essential role in today's IT environments, and satisfy the demands for cloud deployments, analytics, mobile, and social applications in a trustful, reliable, and secure environment with operations that lessen business risk.

Big Data

2015-03-09 O'Reilly Amazon

book

Bernard Marr

data data-engineering Analytics Big Data Data Analytics

Convert the promise of big data into real world results There is so much buzz around big data. We all need to know what it is and how it works - that much is obvious. But is a basic understanding of the theory enough to hold your own in strategy meetings? Probably. But what will set you apart from the rest is actually knowing how to USE big data to get solid, real-world business results - and putting that in place to improve performance. Big Data will give you a clear understanding, blueprint, and step-by-step approach to building your own big data strategy. This is a well-needed practical introduction to actually putting the topic into practice. Illustrated with numerous real-world examples from a cross section of companies and organisations, Big Data will take you through the five steps of the SMART model: Start with Strategy, Measure Metrics and Data, Apply Analytics, Report Results, Transform. Discusses how companies need to clearly define what it is they need to know Outlines how companies can collect relevant data and measure the metrics that will help them answer their most important business questions Addresses how the results of big data analytics can be visualised and communicated to ensure key decisions-makers understand them Includes many high-profile case studies from the author's work with some of the world's best known brands

Apache Hive Essentials

2015-02-26 O'Reilly Amazon

book

Dayong Du

data data-engineering Hadoop apache-hive Analytics Big Data

Apache Hive Essentials is the perfect guide for understanding and mastering Hive, the SQL-like big data query language built on top of Hadoop. With this book, you will gain the skills to effectively use Hive to analyze and manage large data sets. Whether you're a developer, data analyst, or just curious about big data, this hands-on guide will enhance your capabilities. What this Book will help me do Understand the core concepts of Hive and its relation to big data and Hadoop. Learn how to set up a Hive environment and integrate it with Hadoop. Master the SQL-like query functionalities of Hive to select, manipulate, and analyze data. Develop custom functions in Hive to extend its functionality for your own specific use cases. Discover best practices for optimizing Hive performance and ensuring data security. Author(s) Dayong Du is an expert in big data analytics with extensive experience in implementing and using tools like Hive in professional settings. Having worked on practical big data solutions, Dayong brings a wealth of knowledge and insights to his writing. His clear, approachable style makes complex topics accessible to readers. Who is it for? This book is ideal for developers, data analysts, and data engineers looking to leverage Hive for big data analysis. If you are familiar with SQL and Hadoop basics and aim to enhance your understanding of Hive, this book is for you. Beginners with some programming background eager to dive into big data technologies will also benefit. It's tailored for learners wanting actionable knowledge to advance their data processing skills.

Apache Flume: Distributed Log Collection for Hadoop - Second Edition

2015-02-25 O'Reilly Amazon

book

Steven Hoffman

data data-engineering log-data Analytics Big Data ELK

"Apache Flume: Distributed Log Collection for Hadoop - Second Edition" is your hands-on guide to learning how to use Apache Flume to reliably collect and move logs and data streams into your Hadoop ecosystem. Through practical examples and real-world scenarios, this book will help you master the setup, configuration, and optimization of Flume for various data ingestion use cases. What this Book will help me do Understand the key concepts and architecture behind Apache Flume to build reliable and scalable data ingestion systems. Set up Flume agents to collect and transfer data into the Hadoop File System (HDFS) or other storage solutions effectively. Learn stream data processing techniques, such as filtering, transforming, and enriching data during transit to improve data usability. Integrate Flume with other tools like Elasticsearch and Solr to enhance analytics and search capabilities. Implement monitoring and troubleshooting workflows to maintain healthy and optimized Flume data pipelines. Author(s) Steven Hoffman, a seasoned software developer and data engineer, brings years of practical experience working with big data technologies to this book. He has a strong background in distributed systems and big data solutions, having implemented enterprise-scale analytics projects. Through clear and approachable writing, he aims to empower readers to successfully deploy reliable data pipelines using Apache Flume. Who is it for? This book is written for Hadoop developers, data engineers, and IT professionals who seek to build robust pipelines for streaming data into Hadoop environments. It is ideal for readers who have a basic understanding of Hadoop and HDFS but are new to Apache Flume. If you are looking to enhance your analytics capabilities by efficiently ingesting, routing, and processing streaming data, this book is for you. Beginners as well as experienced engineers looking to dive deeper into Flume will find it insightful.

Hadoop MapReduce v2 Cookbook - Second Edition

2015-02-25 O'Reilly Amazon

book

Thilina Gunarathne

data data-engineering Hadoop mapreduce Analytics Big Data

Explore insights from vast datasets with "Hadoop MapReduce v2 Cookbook - Second Edition." This book serves as a practical guide for developers and system administrators who aim to master big data processing using Hadoop v2. By engaging with its step-by-step recipes, you will learn to harness the Hadoop MapReduce ecosystem for scalable and efficient data solutions. What this Book will help me do Master the configuration and management of Hadoop YARN, MapReduce v2, and HDFS clusters. Integrate big data tools such as Hive, HBase, Pig, Mahout, and Nutch with Hadoop v2. Develop analytics solutions for large-scale datasets using MapReduce-based applications. Address specific challenges like data classification, recommendations, and text analytics leveraging Hadoop MapReduce. Deploy and manage big data clusters effectively, including options for cloud environments. Author(s) The authors behind "Hadoop MapReduce v2 Cookbook - Second Edition" combine their deep expertise in big data technology and years of experience working directly with Hadoop. They have helped numerous organizations implement scalable data processing solutions and are passionate about teaching others. Their approach ensures readers gain both foundational knowledge and practical skills. Who is it for? This book is perfect for developers and system administrators who want to learn Hadoop MapReduce v2, including configuring and managing big data clusters. Beginners with basic Java knowledge can follow along to advance their skills in big data processing. Ideal for those transitioning to Hadoop v2 or requiring practical recipes for immediate application. Great for professionals aiming to deepen their expertise in scalable data technologies.

NoSQL For Dummies

2015-02-24 O'Reilly Amazon

book

Adam Fowler

data data-engineering nosql-databases Analytics Big Data Cassandra

Get up to speed on the nuances of NoSQL databases and what they mean for your organization This easy to read guide to NoSQL databases provides the type of no-nonsense overview and analysis that you need to learn, including what NoSQL is and which database is right for you. Featuring specific evaluation criteria for NoSQL databases, along with a look into the pros and cons of the most popular options, NoSQL For Dummies provides the fastest and easiest way to dive into the details of this incredible technology. You'll gain an understanding of how to use NoSQL databases for mission-critical enterprise architectures and projects, and real-world examples reinforce the primary points to create an action-oriented resource for IT pros. If you're planning a big data project or platform, you probably already know you need to select a NoSQL database to complete your architecture. But with options flooding the market and updates and add-ons coming at a rapid pace, determining what you require now, and in the future, can be a tall task. This is where NoSQL For Dummies comes in! Learn the basic tenets of NoSQL databases and why they have come to the forefront as data has outpaced the capabilities of relational databases Discover major players among NoSQL databases, including Cassandra, MongoDB, MarkLogic, Neo4J, and others Get an in-depth look at the benefits and disadvantages of the wide variety of NoSQL database options Explore the needs of your organization as they relate to the capabilities of specific NoSQL databases Big data and Hadoop get all the attention, but when it comes down to it, NoSQL databases are the engines that power many big data analytics initiatives. With NoSQL For Dummies, you'll go beyond relational databases to ramp up your enterprise's data architecture in no time.

Learning Spark

2015-02-17 O'Reilly Amazon

book

Patrick Wendell , Andy Konwinski , Matei Zaharia , Holden Karau

data data-engineering apache-spark Analytics API Data Analytics

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.

Data: Emerging Trends and Technologies

2015-02-15 O'Reilly Amazon

book

Alistair Croll

data data-engineering AI/ML Analytics Big Data Cloud Computing

What are the emerging trends and technologies that will transform the data landscape in coming months? In this report from Strata + Hadoop World co-chair Alistair Croll, you'll learn how the ubiquity of cheap sensors, fast networks, and distributed computing have given rise to several developments that will soon have a profound effect on individuals and society as a whole. Machine learning, for example, has quickly moved from lab tool to hosted, pay-as-you-go services in the cloud. Those services, in turn, are leading to predictive apps that will provide individuals with the right functionality and content at the right time by continuously learning about them and predicting what they'll need. Computational power can produce cognitive augmentation. Report topics include: The swing between centralized and distributed computing Machine learning as a service Personal digital assistants and cognitive augmentation Graph databases and analytics Regulating complex algorithms The pace of real-time data and automation Solving dire problems with big data Implications of having sensors everywhere This report contains many more examples of how big data is starting to reshape business and change behavior, and it's just a small sample of the in-depth information Strata + Hadoop World provides. Pick up this report and make plans to attend one of several Strata + Hadoop World conferences in the San Francisco Bay Area, London, and New York.

Big Data Analytics

2015-02-05 O'Reilly Amazon

book

Kim H. Pries , Robert Dunnigan

data data-engineering AI/ML Analytics Big Data Data Analytics

With this book, managers and decision makers are given the tools to make more informed decisions about big data purchasing initiatives. Big Data Analytics: A Practical Guide for Managers not only supplies descriptions of common tools, but also surveys the various products and vendors that supply the big data market. Comparing and contrasting the different types of analysis commonly conducted with big data, this accessible reference presents clear-cut explanations of the general workings of big data tools. Instead of spending time on HOW to install specific packages, it focuses on the reasons WHY readers would install a given package. The book provides authoritative guidance on a range of tools, including open source and proprietary systems. It details the strengths and weaknesses of incorporating big data analysis into decision-making and explains how to leverage the strengths while mitigating the weaknesses. Describes the benefits of distributed computing in simple terms Includes substantial vendor/tool material, especially for open source decisions Covers prominent software packages, including Hadoop and Oracle Endeca Examines GIS and machine learning applications Considers privacy and surveillance issues The book further explores basic statistical concepts that, when misapplied, can be the source of errors. Time and again, big data is treated as an oracle that discovers results nobody would have imagined. While big data can serve this valuable function, all too often these results are incorrect, yet are still reported unquestioningly. The probability of having erroneous results increases as a larger number of variables are compared unless preventative measures are taken. The approach taken by the authors is to explain these concepts so managers can ask better questions of their analysts and vendors as to the appropriateness of the methods used to arrive at a conclusion. Because the world of science and medicine has been grappling with similar issues in the publication of studies, the authors draw on their efforts and apply them to big data.

ElasticSearch Cookbook - Second Edition

2015-01-28 O'Reilly Amazon

book

Alberto Paro

data data-engineering search elasticsearch Analytics Big Data

The "ElasticSearch Cookbook - Second Edition" is a hands-on guide featuring over 130 advanced recipes to help you harness the power of ElasticSearch, a leading search and analytics engine. Through insightful examples and practical guidance, you'll learn to implement efficient search solutions, optimize queries, and manage ElasticSearch clusters effectively. What this Book will help me do Design and configure ElasticSearch topologies optimized for your specific deployment needs. Develop and utilize custom mappings to optimize your data indexes. Execute advanced queries and filters to refine and retrieve search results effectively. Set up and monitor ElasticSearch clusters for optimal performance. Extend ElasticSearch capabilities through plugin development and integrations using Java and Python. Author(s) Alberto Paro is a technology expert with years of experience working with ElasticSearch, Big Data solutions, and scalable cloud architecture. He has authored multiple books and technical articles on ElasticSearch, leveraging his extensive knowledge to provide practical insights. His approachable and detail-oriented style makes complex concepts accessible to technical professionals. Who is it for? This book is best suited for software developers and IT professionals looking to use ElasticSearch in their projects. Readers should be familiar with JSON, as well as basic programming skills in Java. It is ideal for those who have an understanding of search applications and want to deepen their expertise. Whether you're integrating ElasticSearch into a web application or optimizing your system's search capabilities, this book will provide the skills and knowledge you need.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

IBM Software Defined Environment

Spark Cookbook

ElasticSearch Blueprints

IBM Software Defined Infrastructure for Big Data Analytics Workloads

Implementing an IBM InfoSphere BigInsights Cluster using Linux on Power

Implementing IBM FlashSystem 900

Designing and Operating a Data Reservoir

IBM Spectrum Scale (formerly GPFS)

Implementation Best Practices for IBM DB2 BLU Acceleration with SAP BW on IBM Power Systems

Big Data

Hadoop Essentials

Apache Solr Search Patterns

IBM z13 Technical Guide

The Security Data Lake

Advanced Analytics with Spark

IBM z13 Technical Introduction

Big Data

Apache Hive Essentials

Apache Flume: Distributed Log Collection for Hadoop - Second Edition

Hadoop MapReduce v2 Cookbook - Second Edition

NoSQL For Dummies

Learning Spark

Data: Emerging Trends and Technologies

Big Data Analytics

ElasticSearch Cookbook - Second Edition