O'Reilly Data Engineering Books

Apache Spark Graph Processing

2015-09-10 O'Reilly Amazon

book

Rindra Ramamonjison

data data-engineering apache-spark Analytics API Big Data

Dive into the world of large-scale graph data processing with Apache Spark's GraphX API. This book introduces you to the core concepts of graph analytics and teaches you how to leverage Spark for handling and analyzing massive graphs. From building to analyzing, you'll acquire a comprehensive skillset to work with graph data efficiently. What this Book will help me do Learn to utilize Apache Spark GraphX API to process and analyze graph data. Master transforming raw datasets into sophisticated graph structures. Explore visualization and analysis techniques for understanding graphs. Understand and build custom graph operations tailored to your needs. Implement advanced graph algorithms like clustering and iterative processing. Author(s) Rindra Ramamonjison is a seasoned data engineer with vast experience in big data technologies and graph processing. With a passion for explaining complex concepts in simple terms, Rindra builds on his professional expertise to guide readers in mastering cutting-edge Spark tools. Who is it for? This book is tailored for data scientists and software developers looking to delve into graph data processing at scale. Ideal for those with basic knowledge of Scala and Apache Spark, it equips readers with the tools and techniques to derive insights from complex network datasets. Whether you're diving deeper into big data or exploring graph-specific analytics, this book is your guide.

Learning YARN

2015-08-28 O'Reilly Amazon

book

Akhil Arora , Shrey Mehrotra

data data-engineering Hadoop yarn Big Data Spark

"Learning YARN" is your comprehensive guide to master YARN, the resource management layer in the Hadoop ecosystem. Through the book, you'll leverage YARN's capabilities for big data processing, learning to deploy, manage, and scale Hadoop-YARN clusters. What this Book will help me do Understand the main features and benefits of the YARN framework. Gain experience managing Hadoop clusters of varying sizes. Learn to integrate YARN with domain-specific big data tools like Spark. Become skilled at administration and configuration of YARN. Develop and run your own YARN-based applications for distributed computing. Author(s) Akhil Arora and Shrey Mehrotra bring with them years of experience working in big data frameworks and technologies. With expertise in YARN specifically, they aim to bridge the gap for developers and administrators to learn and implement scalable big data solutions. Their extensive knowledge in cluster management and distributed data processing shines through in how this book is structured and detailed. Who is it for? This book is ideal for software developers, big data engineers, and system administrators interested in advancing their knowledge in resource management in Hadoop systems. If you have basic familiarity with Hadoop and need a deeper understanding or feature knowledge of YARN for professional growth, this book is tailored for you. It is also suitable for learners seeking to integrate big data platforms like Spark into YARN clusters.

Structured Search for Big Data

2015-08-26 O'Reilly Amazon

book

Mikhail Gilula

data data-engineering search Big Data Data Modelling DWH

The WWW era made billions of people dramatically dependent on the progress of data technologies, out of which Internet search and Big Data are arguably the most notable. Structured Search paradigm connects them via a fundamental concept of key-objects evolving out of keywords as the units of search. The key-object data model and KeySQL revamp the data independence principle making it applicable for Big Data and complement NoSQL with full-blown structured querying functionality. The ultimate goal is extracting Big Information from the Big Data. As a Big Data Consultant, Mikhail Gilula combines academic background with 20 years of industry experience in the database and data warehousing technologies working as a Sr. Data Architect for Teradata, Alcatel-Lucent, and PayPal, among others. He has authored three books, including The Set Model for Database and Information Systems and holds four US Patents in Structured Search and Data Integration. Conceptualizes structured search as a technology for querying multiple data sources in an independent and scalable manner. Explains how NoSQL and KeySQL complement each other and serve different needs with respect to big data Shows the place of structured search in the internet evolution and describes its implementations including the real-time structured internet search

Pro Couchbase Development: A NoSQL Platform for the Enterprise

2015-08-05 O'Reilly Amazon

book

Deepak Vohra

data data-engineering nosql-databases couchbase Big Data Cassandra

Pro Couchbase Development: A NoSQL Platform for the Enterprise discusses programming for Couchbase using Java and scripting languages, querying and searching, handling migration, and integrating Couchbase with Hadoop, HDFS, and JSON. It also discusses migration from other NoSQL databases like MongoDB. This book is for big data developers who use Couchbase NoSQL database or want to use Couchbase for their web applications as well as for those migrating from other NoSQL databases like MongoDB and Cassandra. For example, a reason to migrate from Cassandra is that it is not based on the JSON document model with support for a flexible schema without having to define columns and supercolumns. The target audience is largely Java developers but the book also supports PHP and Ruby developers who want to learn about Couchbase. The author supplies examples in Java, PHP, Ruby, and JavaScript. After reading and using this hands-on guide for developing with Couchbase, you'll be able to build complex enterprise, database and cloud applications that leverage this powerful platform.

Spark Cookbook

2015-07-27 O'Reilly Amazon

book

Rishi Yadav

data data-engineering apache-spark AI/ML Analytics Big Data

Spark Cookbook is your practical guide to mastering Apache Spark, encompassing a comprehensive set of patterns and examples. Through its over 60 recipes, you will gain actionable insights into using Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX effectively for your big data needs. What this Book will help me do Understand how to install and configure Apache Spark in various environments. Build data pipelines and perform real-time analytics with Spark Streaming. Utilize Spark SQL for interactive data querying and reporting. Apply machine learning workflows using MLlib, including supervised and unsupervised models. Develop optimized big data solutions and integrate them into enterprise platforms. Author(s) None Yadav, the author of Spark Cookbook, is an experienced data engineer and technical expert with deep insights into big data processing frameworks. Yadav has spent years working with Spark and its ecosystem, providing practical guidance to developers and data scientists alike. This book reflects their commitment to sharing actionable knowledge. Who is it for? This book is designed for data engineers, developers, and data scientists who work with big data systems and wish to utilize Apache Spark effectively. Whether you're looking to optimize existing Spark applications or explore its libraries for new use cases, this book will provide the guidance you need. A basic familiarity with big data concepts and programming in languages like Java or Python is recommended to make the most out of this book.

Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

2015-07-20 O'Reilly Amazon

book

George J. Trujillo Jr. , Charles Kim , Rommel Garcia , Justin Murray , Steven Jones

data data-engineering Hadoop Big Data Cloud Computing Data Management

Plan and Implement Hadoop Virtualization for Maximum Performance, Scalability, and Business Agility Enterprises running Hadoop must absorb rapid changes in big data ecosystems, frameworks, products, and workloads. Virtualized approaches can offer important advantages in speed, flexibility, and elasticity. Now, a world-class team of enterprise virtualization and big data experts guide you through the choices, considerations, and tradeoffs surrounding Hadoop virtualization. The authors help you decide whether to virtualize Hadoop, deploy Hadoop in the cloud, or integrate conventional and virtualized approaches in a blended solution. First, Virtualizing Hadoop reviews big data and Hadoop from the standpoint of the virtualization specialist. The authors demystify MapReduce, YARN, and HDFS and guide you through each stage of Hadoop data management. Next, they turn the tables, introducing big data experts to modern virtualization concepts and best practices. Finally, they bring Hadoop and virtualization together, guiding you through the decisions you’ll face in planning, deploying, provisioning, and managing virtualized Hadoop. From security to multitenancy to day-to-day management, you’ll find reliable answers for choosing your best Hadoop strategy and executing it. Coverage includes the following: • Reviewing the frameworks, products, distributions, use cases, and roles associated with Hadoop • Understanding YARN resource management, HDFS storage, and I/O • Designing data ingestion, movement, and organization for modern enterprise data platforms • Defining SQL engine strategies to meet strict SLAs • Considering security, data isolation, and scheduling for multitenant environments • Deploying Hadoop as a service in the cloud • Reviewing the essential concepts, capabilities, and terminology of virtualization • Applying current best practices, guidelines, and key metrics for Hadoop virtualization • Managing multiple Hadoop frameworks and products as one unified system • Virtualizing master and worker nodes to maximize availability and performance • Installing and configuring Linux for a Hadoop environment

IBM Software Defined Infrastructure for Big Data Analytics Workloads

2015-06-29 O'Reilly Amazon

book

Marcelo Correia Lima , Dino Quintero , Maciej Olejniczak , Daniel de Souza Casali , Istvan Gabor Szabo , Nilton Carlos dos Santos , Tiago Rodrigues de Mello

data data-engineering IBM ibm-power-systems Analytics Big Data

This IBM® Redbooks® publication documents how IBM Platform Computing, with its IBM Platform Symphony® MapReduce framework, IBM Spectrum Scale (based Upon IBM GPFS™), IBM Platform LSF®, the Advanced Service Controller for Platform Symphony are work together as an infrastructure to manage not just Hadoop-related offerings, but many popular industry offeringsm such as Apach Spark, Storm, MongoDB, Cassandra, and so on. It describes the different ways to run Hadoop in a big data environment, and demonstrates how IBM Platform Computing solutions, such as Platform Symphony and Platform LSF with its MapReduce Accelerator, can help performance and agility to run Hadoop on distributed workload managers offered by IBM. This information is for technical professionals (consultants, technical support staff, IT architects, and IT specialists) who are responsible for delivering cost-effective cloud services and big data solutions on IBM Power Systems™ to help uncover insights among client’s data so they can optimize product development and business results.

Implementing an IBM InfoSphere BigInsights Cluster using Linux on Power

2015-06-16 O'Reilly Amazon

book

Dino Quintero , Ichsan Mulia Permata , Peter McCullagh , Pablo Barquero Garro , Franz Friedrich Liebinger Portela , Joanna Wong , Peng Jiang , Luis Carlos Cruz Huertas , John Wright , Esteban Arias Navarro , Rodrigo Ceron Ferreira de Castro

data data-engineering IBM infosphere Analytics Big Data

This IBM® Redbooks® publication demonstrates and documents how to implement and manage an IBM PowerLinux™ cluster for big data focusing on hardware management, operating systems provisioning, application provisioning, cluster readiness check, hardware, operating system, IBM InfoSphere® BigInsights™, IBM Platform Symphony®, IBM Spectrum™ Scale (formerly IBM GPFS™), applications monitoring, and performance tuning. This publication shows that IBM PowerLinux clustering solutions (hardware and software) deliver significant value to clients that need cost-effective, highly scalable, and robust solutions for big data and analytics workloads. This book documents and addresses topics on how to use IBM Platform Cluster Manager to manage PowerLinux BigData data clusters through IBM InfoSphere BigInsights, Spectrum Scale, and Platform Symphony. This book documents how to set up and manage a big data cluster on PowerLinux servers to customize application and programming solutions, and to tune applications to use IBM hardware architectures. This document uses the architectural technologies and the software solutions that are available from IBM to help solve challenging technical and business problems. This book is targeted at technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) that are responsible for delivering cost-effective Linux on IBM Power Systems™ solutions that help uncover insights among client's data so they can act to optimize business results, product development, and scientific discoveries.

Designing and Operating a Data Reservoir

2015-05-26 O'Reilly Amazon

book

Jay Limburn , Mandy Chessell , Nigel L Jones , David Radley , Kevin Shank

data data-engineering Analytics Big Data HTML IBM

Together, big data and analytics have tremendous potential to improve the way we use precious resources, to provide more personalized services, and to protect ourselves from unexpected and ill-intentioned activities. To fully use big data and analytics, an organization needs a system of insight. This is an ecosystem where individuals can locate and access data, and build visualizations and new analytical models that can be deployed into the IT systems to improve the operations of the organization. The data that is most valuable for analytics is also valuable in its own right and typically contains personal and private information about key people in the organization such as customers, employees, and suppliers. Although universal access to data is desirable, safeguards are necessary to protect people's privacy, prevent data leakage, and detect suspicious activity. The data reservoir is a reference architecture that balances the desire for easy access to data with information governance and security. The data reservoir reference architecture describes the technical capabilities necessary for a system of insight, while being independent of specific technologies. Being technology independent is important, because most organizations already have investments in data platforms that they want to incorporate in their solution. In addition, technology is continually improving, and the choice of technology is often dictated by the volume, variety, and velocity of the data being managed. A system of insight needs more than technology to succeed. The data reservoir reference architecture includes description of governance and management processes and definitions to ensure the human and business systems around the technology support a collaborative, self-service, and safe environment for data use. The data reservoir reference architecture was first introduced in Governing and Managing Big Data for Analytics and Decision Makers, REDP-5120, which is available at: http://www.redbooks.ibm.com/redpieces/abstracts/redp5120.html. This IBM® Redbooks publication, Designing and Operating a Data Reservoir, builds on that material to provide more detail on the capabilities and internal workings of a data reservoir.

IBM Spectrum Scale (formerly GPFS)

2015-05-26 O'Reilly Amazon

book

Dino Quintero , Carlos Henrique Fachim , Andrei Socoliuc , Willard Davis , Olaf Weiser , Steve Duersch , Puneet Chaudhary , Luis Bolinches

data data-engineering IBM ibm-spectrum-control Analytics Big Data

This IBM® Redbooks® publication updates and complements the previous publication: Implementing the IBM General Parallel File System in a Cross Platform Environment, SG24-7844, with additional updates since the previous publication version was released with IBM General Parallel File System (GPFS™). Since then, two releases have been made available up to the latest version of IBM Spectrum™ Scale 4.1. Topics such as what is new in Spectrum Scale, Spectrum Scale licensing updates (Express/Standard/Advanced), Spectrum Scale infrastructure support/updates, storage support (IBM and OEM), operating system and platform support, Spectrum Scale global sharing - Active File Management (AFM), and considerations for the integration of Spectrum Scale in IBM Tivoli® Storage Manager (Spectrum Protect) backup solutions are discussed in this new IBM Redbooks publication. This publication provides additional topics such as planning, usability, best practices, monitoring, problem determination, and so on. The main concept for this publication is to bring you up to date with the latest features and capabilities of IBM Spectrum Scale as the solution has become a key component of the reference architecture for clouds, analytics, mobile, social media, and much more. This publication targets technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) responsible for delivering cost effective cloud services and big data solutions on IBM Power Systems™ helping to uncover insights among clients' data so they can take actions to optimize business results, product development, and scientific discoveries.

Current State of Big Data Use in Retail Supply Chains

2015-05-06 O'Reilly Amazon

book

CSCMP

data data-engineering Big Data

Innovation, consisting of invention, adoption, and deployment of new technology and associated process improvements, is a key source of competitive advantages. Big Data is an innovation that has been gaining prominence in retailing and other industries. In fact, managers working in retail supply chain member firms (that is, retailers, manufacturers, distributors, wholesalers, logistics providers, and other service providers) have increasingly been trying to understand what Big Data entails, what it may be used for, and how to make it an integral part of their businesses. This report covers Big Data use, with focus on applications for retail supply chains. The authors’ findings suggest that Big Data use in retail supply chains is still generally elusive. Although most managers have reported initial, and in some cases some significant efforts in analyzing large sets of data for decision making, various challenges confine these data to a range of use spanning traditional, transactional data.

Big Data

2015-04-30 O'Reilly Amazon

book

James Warren , Nathan Marz

data data-engineering AI/ML Analytics AWS Lambda Big Data

Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built. About the Technology About the Book Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive. Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm, and NoSQL databases. What's Inside Introduction to big data systems Real-time processing of web-scale data Tools like Hadoop, Cassandra, and Storm Extensions to traditional database skills About the Reader This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful. About the Authors Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. James Warren is an analytics architect with a background in machine learning and scientific computing. Quotes Transcends individual tools or platforms. Required reading for anyone working with big data systems. - Jonathan Esterhazy, Groupon A comprehensive, example-driven tour of the Lambda Architecture with its originator as your guide. - Mark Fisher, Pivotal Contains wisdom that can only be gathered after tackling many big data projects. A must-read. - Pere Ferrera Bertran, Datasalt The de facto guide to streamlining your data pipeline in batch and near-real time. - Alex Holmes, Author of "Hadoop in Practice"

Hadoop Essentials

2015-04-29 O'Reilly Amazon

book

Shiva Achari

data data-engineering Hadoop Analytics Big Data Data Analytics

In 'Hadoop Essentials,' you'll embark on an engaging journey to master the Hadoop ecosystem. This book covers fundamental to advanced topics, from HDFS and MapReduce to real-time analytics with Spark, empowering you to handle modern data challenges efficiently. What this Book will help me do Understand the core components of Hadoop, including HDFS, YARN, and MapReduce, for foundational knowledge. Learn to optimize Big Data architectures and improve application performance. Utilize tools like Hive and Pig for efficient data querying and processing. Master data ingestion technologies like Sqoop and Flume for seamless data management. Achieve fluency in real-time data analytics using modern tools like Apache Spark and Apache Storm. Author(s) None Achari is a seasoned expert in Big Data and distributed systems with in-depth knowledge of the Hadoop ecosystem. With years of experience in both development and teaching, they craft content that bridges practical know-how with theoretical insights in a highly accessible style. Who is it for? This book is perfect for system and application developers aiming to learn practical applications of Hadoop. It suits professionals seeking solutions to real-world Big Data challenges as well as those familiar with distributed systems basics and looking to deepen their expertise in advanced data analysis.

Real-World Hadoop

2015-04-03 O'Reilly Amazon

book

Ellen Friedman , Ted Dunning

data data-engineering Hadoop Big Data Apache HBase NoSQL

If you’re a business team leader, CIO, business analyst, or developer interested in how Apache Hadoop and Apache HBase-related technologies can address problems involving large-scale data in cost-effective ways, this book is for you. Using real-world stories and situations, authors Ted Dunning and Ellen Friedman show Hadoop newcomers and seasoned users alike how NoSQL databases and Hadoop can solve a variety of business and research issues. You’ll learn about early decisions and pre-planning that can make the process easier and more productive. If you’re already using these technologies, you’ll discover ways to gain the full range of benefits possible with Hadoop. While you don’t need a deep technical background to get started, this book does provide expert guidance to help managers, architects, and practitioners succeed with their Hadoop projects. Examine a day in the life of big data: India’s ambitious Aadhaar project Review tools in the Hadoop ecosystem such as Apache’s Spark, Storm, and Drill to learn how they can help you Pick up a collection of technical and strategic tips that have helped others succeed with Hadoop Learn from several prototypical Hadoop use cases, based on how organizations have actually applied the technology Explore real-world stories that reveal how MapR customers combine use cases when putting Hadoop and NoSQL to work, including in production Ted Dunning is Chief Applications Architect at MapR Technologies, and committer and PMC member of the Apache’s Drill, Storm, Mahout, and ZooKeeper projects. He is also mentor for Apache’s Datafu, Kylin, Zeppelin, Calcite, and Samoa projects. Ellen Friedman is a solutions consultant, speaker, and author, writing mainly about big data topics. She is a committer for the Apache Mahout project and a contributor to the Apache Drill project.

Storm Applied

2015-03-31 O'Reilly Amazon

book

Matthew Jankowski , Peter Pathirana , Sean Allen

data data-engineering streaming-messaging storm Big Data ELK

Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams. This immediately useful book starts by building a solid foundation of Storm essentials so that you learn how to think about designing Storm solutions the right way from day one. But it quickly dives into real-world case studies that will bring the novice up to speed with productionizing Storm. About the Technology It's hard to make sense out of data when it's coming at you fast. Like Hadoop, Storm processes large amounts of data but it does it reliably and in real time, guaranteeing that every message will be processed. Storm allows you to scale with your data as it grows, making it an excellent platform to solve your big data problems. About the Book Storm Applied is an example-driven guide to processing and analyzing real-time data streams. This immediately useful book starts by teaching you how to design Storm solutions the right way. Then, it quickly dives into real-world case studies that show you how to scale a high-throughput stream processor, ensure smooth operation within a production cluster, and more. Along the way, you'll learn to use Trident for stateful stream processing, along with other tools from the Storm ecosystem. What's Inside Mapping real problems to Storm components Performance tuning and scaling Practical troubleshooting and debugging Exactly-once processing with Trident About the Reader This book moves through the basics quickly. While prior experience with Storm is not assumed, some experience with big data and real-time systems is helpful. About the Authors Sean Allen, Matthew Jankowski, and Peter Pathirana lead the development team for a high-volume, search-intensive commercial web application at TheLadders. Quotes Will no doubt become the definitive practitioner’s guide for Storm users. - From the Foreword by Andrew Montalenti The book’s practical approach to Storm will save you a lot of hassle and a lot of time. - Tanguy Leroux, Elasticsearch Great introduction to distributed computing with lots of real-world examples. - Shay Elkin, Tangent Logic Go beyond the MapReduce way of thinking to solve big data problems. - Muthusamy Manigandan, OzoneMedia

Mastering Apache Cassandra - Second Edition

2015-03-26 O'Reilly Amazon

book

Nishant Neeraj

data data-engineering nosql-databases Cassandra Big Data NoSQL

Mastering Apache Cassandra - Second Edition is your comprehensive guide to understanding and utilizing the power of Cassandra, an efficient and scalable NoSQL database. Throughout this book, you will learn how to design, deploy, and manage Cassandra databases effectively, tailored to your application's needs. What this Book will help me do Understand the architecture of Apache Cassandra and how it ensures scalability and reliability. Learn to build, configure, and deploy a Cassandra database cluster for high performance. Develop skills in monitoring and tuning Cassandra clusters for optimal operation. Gain expertise in managing clusters through scaling, node repair, and backup strategies. Integrate Apache Cassandra with other tools and your application seamlessly. Author(s) Nishant Neeraj is an experienced software developer and database engineer with a focus on delivering high-performance solutions. They have extensive hands-on experience with NoSQL databases, especially Apache Cassandra, and bring their practical insights and in-depth technical knowledge to this book to help readers tackle real-world challenges. Who is it for? This book is ideal for intermediate developers aiming to enhance their expertise in NoSQL databases. If you have a foundational understanding of database concepts and want to bring your skills to a professional level by mastering Apache Cassandra for modern applications, this book is perfect for you. It provides actionable insights and guidance suitable for professionals tackling high concurrency and big data challenges. Whether you are a developer, database administrator, or architect, this book provides a targeted deep dive into Cassandra.

Big Data

2015-03-09 O'Reilly Amazon

book

Bernard Marr

data data-engineering Analytics Big Data Data Analytics

Convert the promise of big data into real world results There is so much buzz around big data. We all need to know what it is and how it works - that much is obvious. But is a basic understanding of the theory enough to hold your own in strategy meetings? Probably. But what will set you apart from the rest is actually knowing how to USE big data to get solid, real-world business results - and putting that in place to improve performance. Big Data will give you a clear understanding, blueprint, and step-by-step approach to building your own big data strategy. This is a well-needed practical introduction to actually putting the topic into practice. Illustrated with numerous real-world examples from a cross section of companies and organisations, Big Data will take you through the five steps of the SMART model: Start with Strategy, Measure Metrics and Data, Apply Analytics, Report Results, Transform. Discusses how companies need to clearly define what it is they need to know Outlines how companies can collect relevant data and measure the metrics that will help them answer their most important business questions Addresses how the results of big data analytics can be visualised and communicated to ensure key decisions-makers understand them Includes many high-profile case studies from the author's work with some of the world's best known brands

Big Data Revolution

2015-03-02 O'Reilly Amazon

book

Patrick McSharry , Rob Thomas

data data-engineering Big Data IBM

Exploit the power and potential of Big Data to revolutionize business outcomes Big Data Revolution is a guide to improving performance, making better decisions, and transforming business through the effective use of Big Data. In this collaborative work by an IBM Vice President of Big Data Products and an Oxford Research Fellow, this book presents inside stories that demonstrate the power and potential of Big Data within the business realm. Readers are guided through tried-and-true methodologies for getting more out of data, and using it to the utmost advantage. This book describes the major trends emerging in the field, the pitfalls and triumphs being experienced, and the many considerations surrounding Big Data, all while guiding readers toward better decision making from the perspective of a data scientist. Companies are generating data faster than ever before, and managing that data has become a major challenge. With the right strategy, Big Data can be a powerful tool for creating effective business solutions – but deep understanding is key when applying it to individual business needs. Big Data Revolution provides the insight executives need to incorporate Big Data into a better business strategy, improving outcomes with innovation and efficient use of technology. Examine the major emerging patterns in Big Data Consider the debate surrounding the ethical use of data Recognize patterns and improve personal and organizational performance Make more informed decisions with quantifiable results In an information society, it is becoming increasingly important to make sense of data in an economically viable way. It can drive new revenue streams and give companies a competitive advantage, providing a way forward for businesses navigating an increasingly complex marketplace. Big Data Revolution provides expert insight on the tool that can revolutionize industries.

Field Guide to Hadoop

2015-03-02 O'Reilly Amazon

book

Marshall Presser , Kevin Sitto

data data-engineering Hadoop Avro Big Data Cassandra

If your organization is about to enter the world of big data, you not only need to decide whether Apache Hadoop is the right platform to use, but also which of its many components are best suited to your task. This field guide makes the exercise manageable by breaking down the Hadoop ecosystem into short, digestible sections. You’ll quickly understand how Hadoop’s projects, subprojects, and related technologies work together. Each chapter introduces a different topic—such as core technologies or data transfer—and explains why certain components may or may not be useful for particular needs. When it comes to data, Hadoop is a whole new ballgame, but with this handy reference, you’ll have a good grasp of the playing field. Topics include: Core technologies—Hadoop Distributed File System (HDFS), MapReduce, YARN, and Spark Database and data management—Cassandra, HBase, MongoDB, and Hive Serialization—Avro, JSON, and Parquet Management and monitoring—Puppet, Chef, Zookeeper, and Oozie Analytic helpers—Pig, Mahout, and MLLib Data transfer—Scoop, Flume, distcp, and Storm Security, access control, auditing—Sentry, Kerberos, and Knox Cloud computing and virtualization—Serengeti, Docker, and Whirr

Apache Hive Essentials

2015-02-26 O'Reilly Amazon

book

Dayong Du

data data-engineering Hadoop apache-hive Analytics Big Data

Apache Hive Essentials is the perfect guide for understanding and mastering Hive, the SQL-like big data query language built on top of Hadoop. With this book, you will gain the skills to effectively use Hive to analyze and manage large data sets. Whether you're a developer, data analyst, or just curious about big data, this hands-on guide will enhance your capabilities. What this Book will help me do Understand the core concepts of Hive and its relation to big data and Hadoop. Learn how to set up a Hive environment and integrate it with Hadoop. Master the SQL-like query functionalities of Hive to select, manipulate, and analyze data. Develop custom functions in Hive to extend its functionality for your own specific use cases. Discover best practices for optimizing Hive performance and ensuring data security. Author(s) Dayong Du is an expert in big data analytics with extensive experience in implementing and using tools like Hive in professional settings. Having worked on practical big data solutions, Dayong brings a wealth of knowledge and insights to his writing. His clear, approachable style makes complex topics accessible to readers. Who is it for? This book is ideal for developers, data analysts, and data engineers looking to leverage Hive for big data analysis. If you are familiar with SQL and Hadoop basics and aim to enhance your understanding of Hive, this book is for you. Beginners with some programming background eager to dive into big data technologies will also benefit. It's tailored for learners wanting actionable knowledge to advance their data processing skills.

Apache Flume: Distributed Log Collection for Hadoop - Second Edition

2015-02-25 O'Reilly Amazon

book

Steven Hoffman

data data-engineering log-data Analytics Big Data ELK

"Apache Flume: Distributed Log Collection for Hadoop - Second Edition" is your hands-on guide to learning how to use Apache Flume to reliably collect and move logs and data streams into your Hadoop ecosystem. Through practical examples and real-world scenarios, this book will help you master the setup, configuration, and optimization of Flume for various data ingestion use cases. What this Book will help me do Understand the key concepts and architecture behind Apache Flume to build reliable and scalable data ingestion systems. Set up Flume agents to collect and transfer data into the Hadoop File System (HDFS) or other storage solutions effectively. Learn stream data processing techniques, such as filtering, transforming, and enriching data during transit to improve data usability. Integrate Flume with other tools like Elasticsearch and Solr to enhance analytics and search capabilities. Implement monitoring and troubleshooting workflows to maintain healthy and optimized Flume data pipelines. Author(s) Steven Hoffman, a seasoned software developer and data engineer, brings years of practical experience working with big data technologies to this book. He has a strong background in distributed systems and big data solutions, having implemented enterprise-scale analytics projects. Through clear and approachable writing, he aims to empower readers to successfully deploy reliable data pipelines using Apache Flume. Who is it for? This book is written for Hadoop developers, data engineers, and IT professionals who seek to build robust pipelines for streaming data into Hadoop environments. It is ideal for readers who have a basic understanding of Hadoop and HDFS but are new to Apache Flume. If you are looking to enhance your analytics capabilities by efficiently ingesting, routing, and processing streaming data, this book is for you. Beginners as well as experienced engineers looking to dive deeper into Flume will find it insightful.

Hadoop MapReduce v2 Cookbook - Second Edition

2015-02-25 O'Reilly Amazon

book

Thilina Gunarathne

data data-engineering Hadoop mapreduce Analytics Big Data

Explore insights from vast datasets with "Hadoop MapReduce v2 Cookbook - Second Edition." This book serves as a practical guide for developers and system administrators who aim to master big data processing using Hadoop v2. By engaging with its step-by-step recipes, you will learn to harness the Hadoop MapReduce ecosystem for scalable and efficient data solutions. What this Book will help me do Master the configuration and management of Hadoop YARN, MapReduce v2, and HDFS clusters. Integrate big data tools such as Hive, HBase, Pig, Mahout, and Nutch with Hadoop v2. Develop analytics solutions for large-scale datasets using MapReduce-based applications. Address specific challenges like data classification, recommendations, and text analytics leveraging Hadoop MapReduce. Deploy and manage big data clusters effectively, including options for cloud environments. Author(s) The authors behind "Hadoop MapReduce v2 Cookbook - Second Edition" combine their deep expertise in big data technology and years of experience working directly with Hadoop. They have helped numerous organizations implement scalable data processing solutions and are passionate about teaching others. Their approach ensures readers gain both foundational knowledge and practical skills. Who is it for? This book is perfect for developers and system administrators who want to learn Hadoop MapReduce v2, including configuring and managing big data clusters. Beginners with basic Java knowledge can follow along to advance their skills in big data processing. Ideal for those transitioning to Hadoop v2 or requiring practical recipes for immediate application. Great for professionals aiming to deepen their expertise in scalable data technologies.

NoSQL For Dummies

2015-02-24 O'Reilly Amazon

book

Adam Fowler

data data-engineering nosql-databases Analytics Big Data Cassandra

Get up to speed on the nuances of NoSQL databases and what they mean for your organization This easy to read guide to NoSQL databases provides the type of no-nonsense overview and analysis that you need to learn, including what NoSQL is and which database is right for you. Featuring specific evaluation criteria for NoSQL databases, along with a look into the pros and cons of the most popular options, NoSQL For Dummies provides the fastest and easiest way to dive into the details of this incredible technology. You'll gain an understanding of how to use NoSQL databases for mission-critical enterprise architectures and projects, and real-world examples reinforce the primary points to create an action-oriented resource for IT pros. If you're planning a big data project or platform, you probably already know you need to select a NoSQL database to complete your architecture. But with options flooding the market and updates and add-ons coming at a rapid pace, determining what you require now, and in the future, can be a tall task. This is where NoSQL For Dummies comes in! Learn the basic tenets of NoSQL databases and why they have come to the forefront as data has outpaced the capabilities of relational databases Discover major players among NoSQL databases, including Cassandra, MongoDB, MarkLogic, Neo4J, and others Get an in-depth look at the benefits and disadvantages of the wide variety of NoSQL database options Explore the needs of your organization as they relate to the capabilities of specific NoSQL databases Big data and Hadoop get all the attention, but when it comes down to it, NoSQL databases are the engines that power many big data analytics initiatives. With NoSQL For Dummies, you'll go beyond relational databases to ramp up your enterprise's data architecture in no time.

Data: Emerging Trends and Technologies

2015-02-15 O'Reilly Amazon

book

Alistair Croll

data data-engineering AI/ML Analytics Big Data Cloud Computing

What are the emerging trends and technologies that will transform the data landscape in coming months? In this report from Strata + Hadoop World co-chair Alistair Croll, you'll learn how the ubiquity of cheap sensors, fast networks, and distributed computing have given rise to several developments that will soon have a profound effect on individuals and society as a whole. Machine learning, for example, has quickly moved from lab tool to hosted, pay-as-you-go services in the cloud. Those services, in turn, are leading to predictive apps that will provide individuals with the right functionality and content at the right time by continuously learning about them and predicting what they'll need. Computational power can produce cognitive augmentation. Report topics include: The swing between centralized and distributed computing Machine learning as a service Personal digital assistants and cognitive augmentation Graph databases and analytics Regulating complex algorithms The pace of real-time data and automation Solving dire problems with big data Implications of having sensors everywhere This report contains many more examples of how big data is starting to reshape business and change behavior, and it's just a small sample of the in-depth information Strata + Hadoop World provides. Pick up this report and make plans to attend one of several Strata + Hadoop World conferences in the San Francisco Bay Area, London, and New York.

Learning Hadoop 2

2015-02-13 O'Reilly Amazon

book

GABRIELE MODENA

data data-engineering Hadoop Big Data Cloud Computing Java

Delve into the world of big data with 'Learning Hadoop 2', a comprehensive guide to leveraging the capabilities of Hadoop 2 for data processing and analysis. In this book, you will explore the tools and frameworks that integrate with Hadoop, discovering the best ways to design and deploy effective workflows for managing and analyzing large datasets. What this Book will help me do Understand the fundamentals of the MapReduce framework and its applications. Utilize advanced tools such as Samza and Spark for real-time and iterative data processing. Manage large datasets with data mining techniques tailored for Hadoop environments. Deploy Hadoop applications across various infrastructures, including local clusters and cloud services. Create and orchestrate sophisticated data workflows and pipelines with Apache Pig and Oozie. Author(s) Gabriele Modena is an experienced developer and trained data specialist with a keen focus on distributed data processing frameworks. Having worked extensively with big data platforms, Gabriele brings practical insights and a hands-on perspective to technical subjects. His writing is concise and engaging, aiming to render complex concepts accessible. Who is it for? This book is ideal for system and application developers eager to learn practical implementations of the Hadoop framework. Readers should be familiar with the Unix/Linux command-line interface and Java programming. Prior experience with Hadoop will be advantageous, but not necessary.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Apache Spark Graph Processing

Learning YARN

Structured Search for Big Data

Pro Couchbase Development: A NoSQL Platform for the Enterprise

Spark Cookbook

Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

IBM Software Defined Infrastructure for Big Data Analytics Workloads

Implementing an IBM InfoSphere BigInsights Cluster using Linux on Power

Designing and Operating a Data Reservoir

IBM Spectrum Scale (formerly GPFS)

Current State of Big Data Use in Retail Supply Chains

Big Data

Hadoop Essentials

Real-World Hadoop

Storm Applied

Mastering Apache Cassandra - Second Edition

Big Data

Big Data Revolution

Field Guide to Hadoop

Apache Hive Essentials

Apache Flume: Distributed Log Collection for Hadoop - Second Edition

Hadoop MapReduce v2 Cookbook - Second Edition

NoSQL For Dummies

Data: Emerging Trends and Technologies

Learning Hadoop 2