O'Reilly Data Engineering Books

Data: Emerging Trends and Technologies

2015-02-15 O'Reilly Amazon

book

Alistair Croll

data data-engineering AI/ML Analytics Big Data Cloud Computing

What are the emerging trends and technologies that will transform the data landscape in coming months? In this report from Strata + Hadoop World co-chair Alistair Croll, you'll learn how the ubiquity of cheap sensors, fast networks, and distributed computing have given rise to several developments that will soon have a profound effect on individuals and society as a whole. Machine learning, for example, has quickly moved from lab tool to hosted, pay-as-you-go services in the cloud. Those services, in turn, are leading to predictive apps that will provide individuals with the right functionality and content at the right time by continuously learning about them and predicting what they'll need. Computational power can produce cognitive augmentation. Report topics include: The swing between centralized and distributed computing Machine learning as a service Personal digital assistants and cognitive augmentation Graph databases and analytics Regulating complex algorithms The pace of real-time data and automation Solving dire problems with big data Implications of having sensors everywhere This report contains many more examples of how big data is starting to reshape business and change behavior, and it's just a small sample of the in-depth information Strata + Hadoop World provides. Pick up this report and make plans to attend one of several Strata + Hadoop World conferences in the San Francisco Bay Area, London, and New York.

Big Data Analytics

2015-02-05 O'Reilly Amazon

book

Kim H. Pries , Robert Dunnigan

data data-engineering AI/ML Analytics Big Data Data Analytics

With this book, managers and decision makers are given the tools to make more informed decisions about big data purchasing initiatives. Big Data Analytics: A Practical Guide for Managers not only supplies descriptions of common tools, but also surveys the various products and vendors that supply the big data market. Comparing and contrasting the different types of analysis commonly conducted with big data, this accessible reference presents clear-cut explanations of the general workings of big data tools. Instead of spending time on HOW to install specific packages, it focuses on the reasons WHY readers would install a given package. The book provides authoritative guidance on a range of tools, including open source and proprietary systems. It details the strengths and weaknesses of incorporating big data analysis into decision-making and explains how to leverage the strengths while mitigating the weaknesses. Describes the benefits of distributed computing in simple terms Includes substantial vendor/tool material, especially for open source decisions Covers prominent software packages, including Hadoop and Oracle Endeca Examines GIS and machine learning applications Considers privacy and surveillance issues The book further explores basic statistical concepts that, when misapplied, can be the source of errors. Time and again, big data is treated as an oracle that discovers results nobody would have imagined. While big data can serve this valuable function, all too often these results are incorrect, yet are still reported unquestioningly. The probability of having erroneous results increases as a larger number of variables are compared unless preventative measures are taken. The approach taken by the authors is to explain these concepts so managers can ask better questions of their analysts and vendors as to the appropriateness of the methods used to arrive at a conclusion. Because the world of science and medicine has been grappling with similar issues in the publication of studies, the authors draw on their efforts and apply them to big data.

Big Data Now: 2014 Edition

2014-12-12 O'Reilly Amazon

book

O'Reilly Media, Inc.

data data-engineering AI/ML Analytics API Big Data

In the four years that O'Reilly Media, Inc. has produced its annual Big Data Now report, the data field has grown from infancy into young adulthood. Data is now a leader in some fields and a driver of innovation in others, and companies that use data and analytics to drive decision-making are outperforming their peers. And while access to big data tools and techniques once required significant expertise, today many tools have improved and communities have formed to share best practices. Companies have also started to emphasize the importance of processes, culture, and people. The topics in represent the major forces currently shaping the data world: Big Data Now: 2014 Edition Cognitive augmentation: predictive APIs, graph analytics, and Network Science dashboards Intelligence matters: defining AI, modeling intelligence, deep learning, and "summoning the demon" Cheap sensors, fast networks, and distributed computing: stream processing, hardware data flows, and computing at the edge Data (science) pipelines: broadening the coverage of analytic pipelines with specialized tools Evolving marketplace of big data components: SSDs, Hadoop 2, Spark; and why datacenters need operating systems Design and social science: human-centered design, wearables and real-time communications, and wearable etiquette Building a data culture: moving from prediction to real-time adaptation; and why you need to become a data skeptic Perils of big data: data redlining, intrusive data analysis, and the state of big data ethics

Hadoop in Practice, Second Edition

2014-09-29 O'Reilly Amazon

book

Alex Holmes

data data-engineering Hadoop AI/ML Analytics Big Data

Hadoop in Practice, Second Edition provides over 100 tested, instantly useful techniques that will help you conquer big data, using Hadoop. This revised new edition covers changes and new features in the Hadoop core architecture, including MapReduce 2. Brand new chapters cover YARN and integrating Kafka, Impala, and Spark SQL with Hadoop. You'll also get new and updated techniques for Flume, Sqoop, and Mahout, all of which have seen major new versions recently. In short, this is the most practical, up-to-date coverage of Hadoop available anywhere About the Technology About the Book It's always a good time to upgrade your Hadoop skills! Hadoop in Practice, Second Edition provides a collection of 104 tested, instantly useful techniques for analyzing real-time streams, moving data securely, machine learning, managing large-scale clusters, and taming big data using Hadoop. This completely revised edition covers changes and new features in Hadoop core, including MapReduce 2 and YARN. You'll pick up hands-on best practices for integrating Spark, Kafka, and Impala with Hadoop, and get new and updated techniques for the latest versions of Flume, Sqoop, and Mahout. In short, this is the most practical, up-to-date coverage of Hadoop available. Readers need to know a programming language like Java and have basic familiarity with Hadoop. What's Inside Thoroughly updated for Hadoop 2 How to write YARN applications Integrate real-time technologies like Storm, Impala, and Spark Predictive analytics using Mahout and RR About the Reader About the Author Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. Quotes Very insightful. A deep dive into the Hadoop world. - Andrea Tarocchi, Red Hat, Inc. The most complete material on Hadoop and its ecosystem known to mankind! - Arthur Zubarev, Vital Insights Clear and concise, full of insights and highly applicable information. - Edward de Oliveira Ribeiro, DataStax, Inc. Comprehensive up-to-date coverage of Hadoop 2. - Muthusamy Manigandan, OzoneMedia

Data Classification

2014-07-25 O'Reilly Amazon

book

Charu C. Aggarwal

data data-engineering search AI/ML

Research on the problem of classification tends to be fragmented across such areas as pattern recognition, database, data mining, and machine learning. Addressing the work of these different communities in a unified way, this book explores the underlying algorithms of classification as well as applications of classification in a variety of problem domains, including text, multimedia, social network, and biological data. It presents core methods in data classification, covers recent problem domains, and discusses advanced methods for enhancing the quality of the underlying classification results.

Large Scale and Big Data

2014-06-25 O'Reilly Amazon

book

Sherif Sakr , Mohamed Gaber

data data-engineering AI/ML Analytics Big Data Cloud Computing

Large Scale and Big Data: Processing and Management provides readers with a central source of reference on the data management techniques currently available for large-scale data processing. Presenting chapters written by leading researchers, academics, and practitioners, it addresses the fundamental challenges associated with Big Data processing tools and techniques across a range of computing environments. The book begins by discussing the basic concepts and tools of large-scale Big Data processing and cloud computing. It also provides an overview of different programming models and cloud-based deployment models. The book’s second section examines the usage of advanced Big Data processing techniques in different domains, including semantic web, graph processing, and stream processing. The third section discusses advanced topics of Big Data processing such as consistency management, privacy, and security. Supplying a comprehensive summary from both the research and applied perspectives, the book covers recent research discoveries and applications, making it an ideal reference for a wide range of audiences, including researchers and academics working on databases, data mining, and web scale data processing. After reading this book, you will gain a fundamental understanding of how to use Big Data-processing tools and techniques effectively across application domains. Coverage includes cloud data management architectures, big data analytics visualization, data management, analytics for vast amounts of unstructured data, clustering, classification, link analysis of big data, scalable data mining, and machine learning techniques.

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives

2014-05-07 O'Reilly Amazon

book

Vijay Srinivas Agneeswaran Ph.D

data data-engineering Hadoop AI/ML Analytics Big Data

Master alternative Big Data technologies that can do what Hadoop can't: real-time analytics and iterative machine learning. When most technical professionals think of Big Data analytics today, they think of Hadoop. But there are many cutting-edge applications that Hadoop isn't well suited for, especially real-time analytics and contexts requiring the use of iterative machine learning algorithms. Fortunately, several powerful new technologies have been developed specifically for use cases such as these. Big Data Analytics Beyond Hadoop is the first guide specifically designed to help you take the next steps beyond Hadoop. Dr. Vijay Srinivas Agneeswaran introduces the breakthrough Berkeley Data Analysis Stack (BDAS) in detail, including its motivation, design, architecture, Mesos cluster management, performance, and more. He presents realistic use cases and up-to-date example code for: Spark, the next generation in-memory computing technology from UC Berkeley Storm, the parallel real-time Big Data analytics technology from Twitter GraphLab, the next-generation graph processing paradigm from CMU and the University of Washington (with comparisons to alternatives such as Pregel and Piccolo) Halo also offers architectural and design guidance and code sketches for scaling machine learning algorithms to Big Data, and then realizing them in real-time. He concludes by previewing emerging trends, including real-time video analytics, SDNs, and even Big Data governance, security, and privacy issues. He identifies intriguing startups and new research possibilities, including BDAS extensions and cutting-edge model-driven analytics. Big Data Analytics Beyond Hadoop is an indispensable resource for everyone who wants to reach the cutting edge of Big Data analytics, and stay there: practitioners, architects, programmers, data scientists, researchers, startup entrepreneurs, and advanced students.

IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators

2014-02-07 O'Reilly Amazon

book

Tetsuya Shimada , Robert Uleman , Oliver Brandt , Roger Rea , Bharath Devaraju , Peter Nicholls , Ankit Pasricha , John Thorson , Kevin Foster , Chris Howard , Chuck Ballard , Daniel Farrell , Sandra Tucker , Norbert Schulz

data data-engineering IBM infosphere AI/ML Analytics

This IBM® Redbooks® publication describes visual development, visualization, adapters, analytics, and accelerators for IBM InfoSphere® Streams (V3), a key component of the IBM Big Data platform. Streams was designed to analyze data in motion, and can perform analysis on incredibly high volumes with high velocity, using a wide variety of analytic functions and data types. The Visual Development environment extends Streams Studio with drag-and-drop development, provides round tripping with existing text editors, and is ideal for rapid prototyping. Adapters facilitate getting data in and out of Streams, and V3 supports WebSphere MQ, Apache Hadoop Distributed File System, and IBM InfoSphere DataStage. Significant analytics include the native Streams Processing Language, SPSS Modeler analytics, Complex Event Processing, TimeSeries Toolkit for machine learning and predictive analytics, Geospatial Toolkit for location-based applications, and Annotation Query Language for natural language processing applications. Accelerators for Social Media Analysis and Telecommunications Event Data Analysis sample programs can be modified to build production level applications. Want to learn how to analyze high volumes of streaming data or implement systems requiring high performance across nodes in a cluster? Then this book is for you. Please note that the additional material referenced in the text is not available from IBM.

Data Just Right: Introduction to Large-Scale Data & Analytics

2013-12-19 O'Reilly Amazon

book

Michael Manoochehri

data data-engineering AI/ML Analytics Big Data BigQuery

Making Big Data Work: Real-World Use Cases and Examples, Practical Code, Detailed Solutions Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. Until now, however, most books on “Big Data” have been little more than business polemics or product catalogs. is different: It’s a completely practical and indispensable guide for every Big Data decision-maker, implementer, and strategist. Data Just Right Michael Manoochehri, a former Google engineer and data hacker, writes for professionals who need practical solutions that can be implemented with limited resources and time. Drawing on his extensive experience, he helps you focus on building applications, rather than infrastructure, because that’s where you can derive the most value. Manoochehri shows how to address each of today’s key Big Data use cases in a cost-effective way by combining technologies in hybrid solutions. You’ll find expert approaches to managing massive datasets, visualizing data, building data pipelines and dashboards, choosing tools for statistical analysis, and more. Throughout, the author demonstrates techniques using many of today’s leading data analysis tools, including Hadoop, Hive, Shark, R, Apache Pig, Mahout, and Google BigQuery. Coverage includes Mastering the four guiding principles of Big Data success—and avoiding common pitfalls Emphasizing collaboration and avoiding problems with siloed data Hosting and sharing multi-terabyte datasets efficiently and economically “Building for infinity” to support rapid growth Developing a NoSQL Web app with Redis to collect crowd-sourced data Running distributed queries over massive datasets with Hadoop, Hive, and Shark Building a data dashboard with Google BigQuery Exploring large datasets with advanced visualization Implementing efficient pipelines for transforming immense amounts of data Automating complex processing with Apache Pig and the Cascading Java library Applying machine learning to classify, recommend, and predict incoming information Using R to perform statistical analysis on massive datasets Building highly efficient analytics workflows with Python and Pandas Establishing sensible purchasing strategies: when to build, buy, or outsource Previewing emerging trends and convergences in scalable data technologies and the evolving role of the Data Scientist

Programming Elastic MapReduce

2013-12-19 O'Reilly Amazon

book

Kevin Schmidt , Christopher Phillips

data data-engineering Hadoop mapreduce AI/ML AWS

Although you don’t need a large computing infrastructure to process massive amounts of data with Apache Hadoop, it can still be difficult to get started. This practical guide shows you how to quickly launch data analysis projects in the cloud by using Amazon Elastic MapReduce (EMR), the hosted Hadoop framework in Amazon Web Services (AWS). Authors Kevin Schmidt and Christopher Phillips demonstrate best practices for using EMR and various AWS and Apache technologies by walking you through the construction of a sample MapReduce log analysis application. Using code samples and example configurations, you’ll learn how to assemble the building blocks necessary to solve your biggest data analysis problems. Get an overview of the AWS and Apache software tools used in large-scale data analysis Go through the process of executing a Job Flow with a simple log analyzer Discover useful MapReduce patterns for filtering and analyzing data sets Use Apache Hive and Pig instead of Java to build a MapReduce Job Flow Learn the basics for using Amazon EMR to run machine learning algorithms Develop a project cost model for using Amazon EMR and other AWS tools

Big Data Glossary

2011-09-15 O'Reilly Amazon

book

Pete Warden

data data-engineering nosql-databases AI/ML Big Data NLP

To help you navigate the large number of new data tools available, this guide describes 60 of the most recent innovations, from NoSQL databases and MapReduce approaches to machine learning and visualization tools. Descriptions are based on first-hand experience with these tools in a production environment. This handy glossary also includes a chapter of key terms that help define many of these tool categories: NoSQL Databases—Document-oriented databases using a key/value interface rather than SQL MapReduce—Tools that support distributed computing on large datasets Storage—Technologies for storing data in a distributed way Servers—Ways to rent computing power on remote machines Processing—Tools for extracting valuable information from large datasets Natural Language Processing—Methods for extracting information from human-created text Machine Learning—Tools that automatically perform data analyses, based on results of a one-off analysis Visualization—Applications that present meaningful data graphically Acquisition—Techniques for cleaning up messy public data sources Serialization—Methods to convert data structure or object state into a storable format

Heuristic Search

2011-05-31 O'Reilly Amazon

book

Stefan Edelkamp , Stefan Schroedl

data data-engineering search AI/ML

Search has been vital to artificial intelligence from the very beginning as a core technique in problem solving. The authors present a thorough overview of heuristic search with a balance of discussion between theoretical analysis and efficient implementation and application to real-world problems. Current developments in search such as pattern databases and search with efficient use of external memory and parallel processing units on main boards and graphics cards are detailed. Heuristic search as a problem solving tool is demonstrated in applications for puzzle solving, game playing, constraint satisfaction and machine learning. While no previous familiarity with heuristic search is necessary the reader should have a basic knowledge of algorithms, data structures, and calculus. Real-world case studies and chapter ending exercises help to create a full and realized picture of how search fits into the world of artificial intelligence and the one around us. Provides real-world success stories and case studies for heuristic search algorithms Includes many AI developments not yet covered in textbooks such as pattern databases, symbolic search, and parallel processing units

Information Assurance

2010-07-27 O'Reilly Amazon

book

David Tipper , James Joshi , Yi Qian , Prashant Krishnamurthy

data data-engineering AI/ML Cyber Security

In today’s fast paced, infocentric environment, professionals increasingly rely on networked information technology to do business. Unfortunately, with the advent of such technology came new and complex problems that continue to threaten the availability, integrity, and confidentiality of our electronic information. It is therefore absolutely imperative to take measures to protect and defend information systems by ensuring their security and non-repudiation. Information Assurance skillfully addresses this issue by detailing the sufficient capacity networked systems need to operate while under attack, and itemizing failsafe design features such as alarms, restoration protocols, and management configurations to detect problems and automatically diagnose and respond. Moreover, this volume is unique in providing comprehensive coverage of both state-of-the-art survivability and security techniques, and the manner in which these two components interact to build robust Information Assurance (IA). The first and (so far) only book to combine coverage of both security AND survivability in a networked information technology setting Leading industry and academic researchers provide state-of-the-art survivability and security techniques and explain how these components interact in providing information assurance Additional focus on security and survivability issues in wireless networks

Web Dragons

2010-07-27 O'Reilly Amazon

book

Ian H. Witten , Teresa Numerico , Marco Gori

data data-engineering search AI/ML C#/.NET

Web Dragons offers a perspective on the world of Web search and the effects of search engines and information availability on the present and future world. In the blink of an eye since the turn of the millennium, the lives of people who work with information have been utterly transformed. Everything we need to know is on the web. It's where we learn and play, shop and do business, keep up with old friends and meet new ones. Search engines make it possible for us to find the stuff we need to know. Search engines — web dragons — are the portals through which we access society's treasure trove of information. How do they stack up against librarians, the gatekeepers over centuries past? What role will libraries play in a world whose information is ruled by the web? How is the web organized? Who controls its contents, and how do they do it? How do search engines work? How can web visibility be exploited by those who want to sell us their wares? What's coming tomorrow, and can we influence it? As we witness the dawn of a new era, this book shows readers what it will look like and how it will change their world. Whoever you are: if you care about information, this book will open your eyes and make you blink. Presents a critical view of the idea of funneling information access through a small handful of gateways and the notion of a centralized index--and the problems that may cause Provides promising approaches for addressing the problems, such as the personalization of web services Presented by authorities in the field of digital libraries, web history, machine learning, and web and data mining Find more information at the author's site: webdragons.net

IBM eServer xSeries 450 Planning and Installation Guide

2003-06-30 O'Reilly Amazon

book

David Watts , Michael L. Nelson , Jose Rodriguez Ruibal , Lubos Nikolini , Gerry McGettigan

data data-engineering IBM AI/ML Linux

The IBM eServer xSeries 450 is IBM’s new 64-bit Itanium Processor Family (IPF) Architecture server and is the first implementation of the 64-bit IBM XA-64 chipset, as part of the Enterprise X-Architecture strategy. This IBM Redbooks publication is a comprehensive resource on the technical aspects of the server, and is divided into five key subject areas: Chapter 1, Technical description introduces the server and its subsystems and describes the key features and how they work. This includes the new Extensible Firmware Interface, which provides a powerful replacement to the BIOS facility found on the IA-32 platform. Chapter 2, Positioning examines the types of applications that would be used on a server such as the x450. Chapter 3, Planning describes the considerations when planning to purchase and planning to install the x450. It covers such topics as configuration, operating system specifics, scalability, and physical site planning. Chapter 4, Installation covers the process of installing Windows Server 2003, Enterprise Edition and SuSE Linux Enterprise Server on the x450. Chapter 5, Management describes how to use the Remote Supervisor Adapter to send alerts to an IBM Director management environment.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Data: Emerging Trends and Technologies

Big Data Analytics

Big Data Now: 2014 Edition

Hadoop in Practice, Second Edition

Data Classification

Large Scale and Big Data

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives

IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators

Data Just Right: Introduction to Large-Scale Data & Analytics

Programming Elastic MapReduce

Big Data Glossary

Heuristic Search

Information Assurance

Web Dragons

IBM eServer xSeries 450 Planning and Installation Guide