O'Reilly Data Engineering Books

Solr in Action

2014-03-25 O'Reilly Amazon

book

Trey Grainger , Timothy Potter

data data-engineering search solr Analytics Big Data

Solr in Action is a comprehensive guide to implementing scalable search using Apache Solr. This clearly written book walks you through well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries. It will give you a deep understanding of how to implement core Solr capabilities. About the Technology About the Book Whether you're handling big (or small) data, managing documents, or building a website, it is important to be able to quickly search through your content and discover meaning in it. Apache Solr is your tool: a ready-to-deploy, Lucene-based, open source, full-text search engine. Solr can scale across many servers to enable real-time queries and data analytics across billions of documents. Solr in Action teaches you to implement scalable search using Apache Solr. This easy-to-read guide balances conceptual discussions with practical examples to show you how to implement all of Solr's core capabilities. You'll master topics like text analysis, faceted search, hit highlighting, result grouping, query suggestions, multilingual search, advanced geospatial and data operations, and relevancy tuning. What's Inside How to scale Solr for big data Rich real-world examples Solr as a NoSQL data store Advanced multilingual, data, and relevancy tricks Coverage of versions through Solr 4.7 About the Reader This book assumes basic knowledge of Java and standard database technology. No prior knowledge of Solr or Lucene is required. About the Authors Trey Grainger is a director of engineering at CareerBuilder. Timothy Potter is a senior member of the engineering team at LucidWorks. The authors work on the scalability and reliability of Solr, as well as on recommendation engine and big data analytics technologies. Quotes The knowledge and techniques you need. - From the Foreword by Yonik Seeley, Creator of Solr Readable and immediately applicable ... an excellent book. - John Viviano, InterCorp, Inc. The go-to guide for Solr ... a definitive resource for both beginners and experts. - Scott Anthony, Business Instruments A well-dosed combination of deep technical knowledge and real-world experience. - Alexandre Madurell, Piksel, Inc.

Apache Hadoop™ YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop™ 2

2014-03-19 O'Reilly Amazon

book

Joseph Niemiec , Arun C. Murthy , Vinod Kumar Vavilapalli , Doug Eadline , Jeff Markham

data data-engineering Hadoop yarn Big Data Java

“This book is a critically needed resource for the newly released Apache Hadoop 2.0, highlighting YARN as the significant breakthrough that broadens Hadoop beyond the MapReduce paradigm.” —From the Foreword by Raymie Stata, CEO of Altiscale The Insider’s Guide to Building Distributed, Big Data Applications with Apache Hadoop™ YARN Apache Hadoop is helping drive the Big Data revolution. Now, its data processing has been completely overhauled: Apache Hadoop YARN provides resource management at data center scale and easier ways to create distributed applications that process petabytes of data. And now in two Hadoop technical leaders show you how to develop new applications and adapt existing code to fully leverage these revolutionary advances. Apache Hadoop™ YARN, YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli demonstrate how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment. You’ll find many examples drawn from the authors’ cutting-edge experience—first as Hadoop’s earliest developers and implementers at Yahoo! and now as Hortonworks developers moving the platform forward and helping customers succeed with it. Coverage includes YARN’s goals, design, architecture, and components—how it expands the Apache Hadoop ecosystem Exploring YARN on a single node Administering YARN clusters and Capacity Scheduler Running existing MapReduce applications Developing a large-scale clustered YARN application Discovering new open source frameworks that run under YARN

Microsoft Big Data Solutions

2014-03-10 O'Reilly Amazon

book

James Rowland-Jones , Brian Mitchell , Christopher Price , Dan Clark , Adam Jorgensen , John Welch

data data-engineering google-bigquery BI Big Data Cloud Computing

Tap the power of Big Data with Microsoft technologies Big Data is here, and Microsoft's new Big Data platform is a valuable tool to help your company get the very most out of it. This timely book shows you how to use HDInsight along with HortonWorks Data Platform for Windows to store, manage, analyze, and share Big Data throughout the enterprise. Focusing primarily on Microsoft and HortonWorks technologies but also covering open source tools, Microsoft Big Data Solutions explains best practices, covers on-premises and cloud-based solutions, and features valuable case studies. Best of all, it helps you integrate these new solutions with technologies you already know, such as SQL Server and Hadoop. Walks you through how to integrate Big Data solutions in your company using Microsoft's HDInsight Server, HortonWorks Data Platform for Windows, and open source tools Explores both on-premises and cloud-based solutions Shows how to store, manage, analyze, and share Big Data through the enterprise Covers topics such as Microsoft's approach to Big Data, installing and configuring HortonWorks Data Platform for Windows, integrating Big Data with SQL Server, visualizing data with Microsoft and HortonWorks BI tools, and more Helps you build and execute a Big Data plan Includes contributions from the Microsoft and HortonWorks Big Data product teams If you need a detailed roadmap for designing and implementing a fully deployed Big Data solution, you'll want Microsoft Big Data Solutions.

Optimizing Hadoop for MapReduce

2014-02-21 O'Reilly Amazon

book

Khaled Tannir

data data-engineering Hadoop mapreduce Big Data Cloud Computing

"Optimizing Hadoop for MapReduce" is your comprehensive guide to getting the best performance out of your Hadoop-based big data processing jobs. With a focus on practical application rather than theory, this book delves into the nuances of MapReduce job design, execution, and optimization to help you harness the full power of this technology. What this Book will help me do Understand the internal workings of Hadoop MapReduce and how it executes jobs. Master key optimization techniques to improve Hadoop job efficiency and resource use. Learn advanced MapReduce programming concepts to handle complex data processing tasks. Analyze and monitor Hadoop job performance using practical tools and methods. Integrate best practices for scaling production workloads in a Hadoop cluster. Author(s) Khaled Tannir is a seasoned software engineer and an expert in distributed systems, big data, and cloud technologies. He has decades of experience designing and optimizing systems for high-performance data processing. Khaled's hands-on approach to explaining technical concepts ensures readers gain practical, applied knowledge that can be immediately implemented in real-world projects. Who is it for? This book is intended for developers, data engineers, and system architects who work with or are planning to work with Apache Hadoop. Ideal readers should have basic familiarity with Hadoop concepts and a foundational understanding of distributed systems. This book will benefit professionals looking to optimize their Hadoop-based applications or understand advanced usage of MapReduce. Whether you're aiming to improve your existing knowledge or implement high-performance data solutions, this book is tailored for you.

Big Data

2014-02-18 O'Reilly Amazon

book

Hassan A. Karimi

data data-engineering google-bigquery Big Data

Big Data is defined as "a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools". The challenges include capture, storage, search, sharing, analysis, and visualization." Big Data has always been a major challenge in geoinformatics as geospatial databases are inherently very large. This book will integrate in one single volume techniques and technologies for storing and managing very large geospatial databases and help developing new geoinformatics software and systems that involve very large databases.

IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators

2014-02-07 O'Reilly Amazon

book

Tetsuya Shimada , Robert Uleman , Oliver Brandt , Roger Rea , Bharath Devaraju , Peter Nicholls , Ankit Pasricha , John Thorson , Kevin Foster , Chris Howard , Chuck Ballard , Daniel Farrell , Sandra Tucker , Norbert Schulz

data data-engineering IBM infosphere AI/ML Analytics

This IBM® Redbooks® publication describes visual development, visualization, adapters, analytics, and accelerators for IBM InfoSphere® Streams (V3), a key component of the IBM Big Data platform. Streams was designed to analyze data in motion, and can perform analysis on incredibly high volumes with high velocity, using a wide variety of analytic functions and data types. The Visual Development environment extends Streams Studio with drag-and-drop development, provides round tripping with existing text editors, and is ideal for rapid prototyping. Adapters facilitate getting data in and out of Streams, and V3 supports WebSphere MQ, Apache Hadoop Distributed File System, and IBM InfoSphere DataStage. Significant analytics include the native Streams Processing Language, SPSS Modeler analytics, Complex Event Processing, TimeSeries Toolkit for machine learning and predictive analytics, Geospatial Toolkit for location-based applications, and Annotation Query Language for natural language processing applications. Accelerators for Social Media Analysis and Telecommunications Event Data Analysis sample programs can be modified to build production level applications. Want to learn how to analyze high volumes of streaming data or implement systems requiring high performance across nodes in a cluster? Then this book is for you. Please note that the additional material referenced in the text is not available from IBM.

Data Just Right: Introduction to Large-Scale Data & Analytics

2013-12-19 O'Reilly Amazon

book

Michael Manoochehri

data data-engineering AI/ML Analytics Big Data BigQuery

Making Big Data Work: Real-World Use Cases and Examples, Practical Code, Detailed Solutions Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. Until now, however, most books on “Big Data” have been little more than business polemics or product catalogs. is different: It’s a completely practical and indispensable guide for every Big Data decision-maker, implementer, and strategist. Data Just Right Michael Manoochehri, a former Google engineer and data hacker, writes for professionals who need practical solutions that can be implemented with limited resources and time. Drawing on his extensive experience, he helps you focus on building applications, rather than infrastructure, because that’s where you can derive the most value. Manoochehri shows how to address each of today’s key Big Data use cases in a cost-effective way by combining technologies in hybrid solutions. You’ll find expert approaches to managing massive datasets, visualizing data, building data pipelines and dashboards, choosing tools for statistical analysis, and more. Throughout, the author demonstrates techniques using many of today’s leading data analysis tools, including Hadoop, Hive, Shark, R, Apache Pig, Mahout, and Google BigQuery. Coverage includes Mastering the four guiding principles of Big Data success—and avoiding common pitfalls Emphasizing collaboration and avoiding problems with siloed data Hosting and sharing multi-terabyte datasets efficiently and economically “Building for infinity” to support rapid growth Developing a NoSQL Web app with Redis to collect crowd-sourced data Running distributed queries over massive datasets with Hadoop, Hive, and Shark Building a data dashboard with Google BigQuery Exploring large datasets with advanced visualization Implementing efficient pipelines for transforming immense amounts of data Automating complex processing with Apache Pig and the Cascading Java library Applying machine learning to classify, recommend, and predict incoming information Using R to perform statistical analysis on massive datasets Building highly efficient analytics workflows with Python and Pandas Establishing sensible purchasing strategies: when to build, buy, or outsource Previewing emerging trends and convergences in scalable data technologies and the evolving role of the Data Scientist

Big Data Application Architecture Q&A: A Problem - Solution Approach

2013-12-17 O'Reilly Amazon

book

Himanshu Shah , Nitin Sawant

data data-engineering Analytics Big Data Hadoop

Big Data Application Architecture Pattern Recipes provides an insight into heterogeneous infrastructures, databases, and visualization and analytics tools used for realizing the architectures of big data solutions. Its problem-solution approach helps in selecting the right architecture to solve the problem at hand. In the process of reading through these problems, you will learn harness the power of new big data opportunities which various enterprises use to attain real-time profits. Big Data Application Architecture Pattern Recipes answers one of the most critical questions of this time 'how do you select the best end-to-end architecture to solve your big data problem?'. The book deals with various mission critical problems encountered by solution architects, consultants, and software architects while dealing with the myriad options available for implementing a typical solution, trying to extract insight from huge volumes of data in real-time and across multiple relational and non-relational data types for clients from industries like retail, telecommunication, banking, and insurance. The patterns in this book provide the strong architectural foundation required to launch your next big data application. The architectures for realizing these opportunities are based on relatively less expensive and heterogeneous infrastructures compared to the traditional monolithic and hugely expensive options that exist currently. This book describes and evaluates the benefits of heterogeneity which brings with it multiple options of solving the same problem, evaluation of trade-offs and validation of 'fitness-for-purpose' of the solution. What you'll learn Major considerations in building a big data solution Big data application architectures problems for specific industries What are the components one needs to build and end-to-end big data solution? Does one really need a real-time big data solution or an off-line analytics batch solution? What are the operations and support architectures for a big data solution? What are the scalability considerations, and options for a Hadoop installation? Who this book is for CIOs, CTOs, enterprise architects, and software architects Consultants, solution architects, and information management (IM) analysts who want to architect a big data solution for their enterprise

Oracle NoSQL Database

2013-12-06 O'Reilly Amazon

book

Ashok Joshi , Chaitanya Kadaru , Maqsood Alam , Aalok Muley

data data-engineering nosql-databases API Avro Big Data

Master Oracle NoSQL Database Enable highly reliable, scalable, and available data. Oracle NoSQL Database: Real-Time Big Data Management for the Enterprise shows you how to take full advantage of this cost-effective solution for storing, retrieving, and updating high-volume, unstructured data. The book covers installation, configuration, application development, capacity planning and sizing, and integration with other enterprise data center products. Real-world examples illustrate the concepts presented in this Oracle Press guide. Understand Oracle NoSQL Database architecture and the underlying data storage engine, Oracle Berkeley DB Install and configure Oracle NoSQL Database for optimal performance Develop complex, distributed applications using a rich set of APIs Read and write data into the Oracle NoSQL Database key-value store Apply an Avro schema to the value portion of the key-value pair using Avro bindings Learn best practices for capacity planning and sizing an enterpriselevel Oracle NoSQL Database deployment Integrate Oracle NoSQL Database with Oracle Database, Oracle Event Processing, and Hadoop Code examples from the book are available for download at www.OraclePressBooks.com.

Big Data Computing

2013-12-05 O'Reilly Amazon

book

Rajendra Akerkar

data data-engineering Big Data

Novel approaches and tools have emerged to tackle the challenges of Big Data. Moreover, the technology required for Big Data computing is developing at a satisfactory rate due to market forces and technological evolution. This book presents a mix of theory and industry cases that discuss the technical and practical issues related to Big Data in intelligent information management. It emphasizes the adoption and diffusion of Big Data tools and technologies in real practical applications.

Oracle® 12c For Dummies®

2013-11-29 O'Reilly Amazon

book

Michael Wessler , Chris Ruel

data data-engineering oracle-database-solutions Big Data Cloud Computing Oracle

Demystifying the power of the Oracle 12c database The Oracle database is the industry-leading relational database management system (RDMS) used from small companies to the world's largest enterprises alike for their most critical business and analytical processing. Oracle 12c includes industry leading enhancements to enable cloud computing and empowers users to manage both Big Data and traditional data structures faster and cheaper than ever before. Oracle 12c For Dummies is the perfect guide for a novice database administrator or an Oracle DBA who is new to Oracle 12c. The book covers what you need to know about Oracle 12c architecture, software tools, and how to successfully manage Oracle databases in the real world. Highlights the important features of Oracle 12c Explains how to create, populate, protect, tune, and troubleshoot a new Oracle database Covers advanced Oracle 12c technologies including Oracle Multitenant—the “pluggable database” concept—as well as several other key changes in this release Make the most of Oracle 12c's improved efficiency, stronger security, and simplified management capabilities with Oracle 12c For Dummies.

Securing Hadoop

2013-11-22 O'Reilly Amazon

book

Sudheesh Narayan

data data-engineering Hadoop Big Data Cyber Security

"Securing Hadoop" provides a comprehensive guide to implementing and understanding security within a Hadoop-based Big Data ecosystem. The book explores key topics like authentication, authorization, and data encryption, ensuring you gain practical insights on how to protect sensitive information effectively and integrate security measures into your Hadoop platform. What this Book will help me do Understand the key security challenges associated with Hadoop and Big Data platforms. Learn how to implement Kerberos authentication and integrate it with Hadoop. Master the configuration of authorization mechanisms for a secure Hadoop ecosystem. Gain knowledge about security event auditing and monitoring techniques specifically for Hadoop. Get a detailed overview of tools and protocols to build and secure a Hadoop infrastructure effectively. Author(s) Sudheesh Narayan is an experienced professional in the fields of Hadoop and enterprise security. With years of expertise in designing and implementing secure distributed data platforms, Sudheesh brings practical insights and step-by-step solutions to Hadoop practitioners. His teaching approach is hands-on, ensuring readers can directly apply theoretical concepts to real-world scenarios. Who is it for? This book is ideal for Hadoop practitioners including solution architects, administrators, and developers seeking to enhance their understanding of security mechanisms for Hadoop. It assumes a foundational knowledge of Hadoop and requires familiarity with basic security concepts. Readers aiming to implement secure Hadoop systems for enterprise-level applications will find this book especially beneficial.

The Definitive Guide to MongoDB: A complete guide to dealing with Big Data using MongoDB, Second Edition

2013-11-06 O'Reilly Amazon

book

David Hows , Eelco Plugge , Peter Membrey , Tim Hawkins

data data-engineering nosql-databases MongoDB Amazon EC2 Azure

The Definitive Guide to MongoDB, Second Edition, is updated for the latest version and includes all of the latest MongoDB features, including the aggregation framework introduced in version 2.2 and hashed indexes in version 2.4. MongoDB is the most popular of the "Big Data" NoSQL database technologies, and it's still growing. David Hows from 10gen, along with experienced MongoDB authors Peter Membrey and Eelco Plugge, provide their expertise and experience in teaching you everything you need to know to become a MongoDB pro. The Definitive Guide to MongoDB, Second Edition, starts with the basics, including how to install on Windows, Linux, and OS X, and how MongoDB handles your data. Then you'll learn how to develop with MongoDB with both PHP and Python, including an example application using a PHP driver to create a blog application. Finally, you'll dig into more advanced but extremely important MongoDB features, including optimization, replication, and sharding -- load-balancing that makes MongoDB ideal for dealing with Big Data. If you're dealing with data, MongoDB should be on your must-learn list. The Definitive Guide to MongoDB, Second Edition, is just the book you need. What you'll learn Set up MongoDB on all major server platforms, including Windows, Linux, OS X, and cloud platforms like Rackspace, Azure, and Amazon EC2 Work with GridFS and the new aggregation framework Work with your data using non-SQL commands Write applications using either PHP or Python Optimize MongoDB Master MongoDB administration, including replication, replication tagging, and tag-aware sharding Who this book is for Database admins and developers who need to get up to speed on MongoDB and its Big Data, NoSQL approach to dealing with data management.

The Culture of Big Data

2013-10-08 O'Reilly Amazon

book

Mike Barlow

data data-engineering Big Data Data Management DWH Hadoop

Technology does not exist in a vacuum. In the same way that a plant needs water and nourishment to grow, technology needs people and process to thrive and succeed. Culture (i.e., people and process) is integral and critical to the success of any new technology deployment or implementation. Big data is not just a technology phenomenon. It has a cultural dimension. It's vitally important to remember that most people have not considered the immense difference between a world seen through the lens of a traditional relational database system and a world seen through the lens of a Hadoop Distributed File System.This paper broadly describes the cultural challenges that accompany efforts to create and sustain big data initiatives in an evolving world whose data management processes are rooted firmly in traditional data warehouse architectures.

DB2 10.5 with BLU Acceleration

2013-10-07 O'Reilly Amazon

book

Paul Zikopoulos , Matthew Huras , Aamer Sachedina , Sam Lightstone , George Baklarz

data data-engineering relational-databases ibm-db2 Analytics Big Data

UPGRADE TO THE NEW GENERATION OF DATABASE SOFTWARE FOR THE ERA OF BIG DATA! If big data is an untapped natural resource, how do you find the gold hidden within? Leaders realize that big data means all data, and are moving quickly to extract more value from both structured and unstructured application data. However, analyzing this data can prove costly and complex, especially while protecting the availability, performance and reliability of essential business applications. In the new era of big data, businesses require data systems that can blend always-available transactions with speed-of-thought analytics. DB2 10.5 with BLU Acceleration provides this speed, simplicity, and affordability while making it easier to build next-generation applications with NoSQL features, such as a mongo-styled JSON document store, a graph store, and more. Dynamic in-memory columnar processing and other innovations deliver faster insights from more data, and enhanced pureScale clustering technology delivers high-availability transactions with application-transparent scalability for business continuity. With this book, you'll learn about the power and flexibility of multiworkload, multi-platform database software. Use the comprehensive knowledge from a team of DB2 developers and experts to get started with the latest DB2 trial version you can download at ibm.com/developerworks/downloads/im/db2/. Stay up to date on DB2 by visiting ibm.com/db2/.

Joe Celko’s Complete Guide to NoSQL

2013-10-07 O'Reilly Amazon

book

Joe Celko

data data-engineering nosql-databases Big Data NoSQL SQL

Joe Celko's Complete Guide to NoSQL provides a complete overview of non-relational technologies so that you can become more nimble to meet the needs of your organization. As data continues to explode and grow more complex, SQL is becoming less useful for querying data and extracting meaning. In this new world of bigger and faster data, you will need to leverage non-relational technologies to get the most out of the information you have. Learn where, when, and why the benefits of NoSQL outweigh those of SQL with Joe Celko's Complete Guide to NoSQL. This book covers three areas that make today's new data different from the data of the past: velocity, volume and variety. When information is changing faster than you can collect and query it, it simply cannot be treated the same as static data. Celko will help you understand velocity, to equip you with the tools to drink from a fire hose. Old storage and access models do not work for big data. Celko will help you understand volume, as well as different ways to store and access data such as petabytes and exabytes. Not all data can fit into a relational model, including genetic data, semantic data, and data generated by social networks. Celko will help you understand variety, as well as the alternative storage, query, and management frameworks needed by certain kinds of data. Gain a complete understanding of the situations in which SQL has more drawbacks than benefits so that you can better determine when to utilize NoSQL technologies for maximum benefit Recognize the pros and cons of columnar, streaming, and graph databases Make the transition to NoSQL with the expert guidance of best-selling SQL expert Joe Celko

Oracle Big Data Handbook

2013-10-06 O'Reilly Amazon

book

Keith Laker , Gokula Mishra , David Segleau , Brian Macdonald , Mark Hornick , Debra Harding , Robert Stackowiak , Helen Sun , Khader Mohiuddin , Tom Plunkett , Bruce Nelson

data data-engineering oracle-database-solutions Analytics Big Data Data Governance

Transform Big Data into Insight "In this book, some of Oracle's best engineers and architects explain how you can make use of big data. They'll tell you how you can integrate your existing Oracle solutions with big data systems, using each where appropriate and moving data between them as needed." -- Doug Cutting, co-creator of Apache Hadoop Cowritten by members of Oracle's big data team, Oracle Big Data Handbook provides complete coverage of Oracle's comprehensive, integrated set of products for acquiring, organizing, analyzing, and leveraging unstructured data. The book discusses the strategies and technologies essential for a successful big data implementation, including Apache Hadoop, Oracle Big Data Appliance, Oracle Big Data Connectors, Oracle NoSQL Database, Oracle Endeca, Oracle Advanced Analytics, and Oracle's open source R offerings. Best practices for migrating from legacy systems and integrating existing data warehousing and analytics solutions into an enterprise big data infrastructure are also included in this Oracle Press guide. Understand the value of a comprehensive big data strategy Maximize the distributed processing power of the Apache Hadoop platform Discover the advantages of using Oracle Big Data Appliance as an engineered system for Hadoop and Oracle NoSQL Database Configure, deploy, and monitor Hadoop and Oracle NoSQL Database using Oracle Big Data Appliance Integrate your existing data warehousing and analytics infrastructure into a big data architecture Share data among Hadoop and relational databases using Oracle Big Data Connectors Understand how Oracle NoSQL Database integrates into the Oracle Big Data architecture Deliver faster time to value using in-database analytics Analyze data with Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining), Oracle R Distribution, ROracle, and Oracle R Connector for Hadoop Analyze disparate data with Oracle Endeca Information Discovery Plan and implement a big data governance strategy and develop an architecture and roadmap

Professional Hadoop Solutions

2013-09-23 O'Reilly Amazon

book

Kevin T. Smith , Alexey Yakubovich , Boris Lublinsky

data data-engineering Hadoop API AWS Big Data

The go-to guidebook for deploying Big Data solutions with Hadoop Today's enterprise architects need to understand how the Hadoop frameworks and APIs fit together, and how they can be integrated to deliver real-world solutions. This book is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth. With in-depth code examples in Java and XML and the latest on recent additions to the Hadoop ecosystem, this complete resource also covers the use of APIs, exposing their inner workings and allowing architects and developers to better leverage and customize them. The ultimate guide for developers, designers, and architects who need to build and deploy Hadoop applications Covers storing and processing data with various technologies, automating data processing, Hadoop security, and delivering real-time solutions Includes detailed, real-world examples and code-level guidelines Explains when, why, and how to use these tools effectively Written by a team of Hadoop experts in the programmer-to-programmer Wrox style Professional Hadoop Solutions is the reference enterprise architects and developers need to maximize the power of Hadoop.

Oracle Essentials, 5th Edition

2013-09-06 O'Reilly Amazon

book

Jonathan Stern , Rick Greenwald , Robert Stackowiak

data data-engineering oracle-database-solutions BI Big Data Cloud Computing

Written by Oracle insiders, this indispensable guide distills an enormous amount of information about the Oracle Database into one compact volume. Ideal for novice and experienced DBAs, developers, managers, and users, Oracle Essentials walks you through technologies and features in Oracle’s product line, including its architecture, data structures, networking, concurrency, and tuning. Complete with illustrations and helpful hints, this fifth edition provides a valuable one-stop overview of Oracle Database 12c, including an introduction to Oracle and cloud computing. Oracle Essentials provides the conceptual background you need to understand how Oracle truly works. Topics include: A complete overview of Oracle databases and data stores, and Fusion Middleware products and features Core concepts and structures in Oracle’s architecture, including pluggable databases Oracle objects and the various datatypes Oracle supports System and database management, including Oracle Enterprise Manager 12c Security options, basic auditing capabilities, and options for meeting compliance needs Performance characteristics of disk, memory, and CPU tuning Basic principles of multiuser concurrency Oracle’s online transaction processing (OLTP) Data warehouses, Big Data, and Oracle’s business intelligence tools Backup and recovery, and high availability and failover solutions

Making Sense of NoSQL

2013-09-03 O'Reilly Amazon

book

Ann Kelly , Dan McCreary

data data-engineering nosql-databases Big Data Cloud Computing NoSQL

Making Sense of NoSQL clearly and concisely explains the concepts, features, benefits, potential, and limitations of NoSQL technologies. Using examples and use cases, illustrations, and plain, jargon-free writing, this guide shows how you can effectively assemble a NoSQL solution to replace or augment the traditional RDBMS you have now. About the Technology About the Book If you want to understand and perhaps start using the new data storage and analysis technologies that go beyond the SQL database model, this book is for you. Written in plain language suitable for technical managers and developers, and using many examples, use cases, and illustrations, this book explains the concepts, features, benefits, potential, and limitations of NoSQL. Making Sense of NoSQL starts by comparing familiar database concepts to the new NoSQL patterns that augment or replace them. Then, you'll explore case studies on big data, search, reliability, and business agility that apply these new patterns to today's business problems. You'll see how NoSQL systems can leverage the resources of modern cloud computing and multiple-CPU data centers. The final chapters show you how to choose the right NoSQL technologies for your own needs. What's Inside NoSQL data architecture patterns NoSQL for big data Search, high availability, and security Choosing an architecture About the Reader Managers and developers will welcome this lucid overview of the potential and capabilities of NoSQL technologies. About the Authors Dan McCreary and Ann Kelly lead an independent training and consultancy firm focused on NoSQL solutions and are cofounders of the NoSQL Now! Conference. Quotes Easily digestible, practical advice for technical managers, architects, and developers. - From the Foreword by Tony Shaw, CEO of DATAVERSITY Cuts through the jargon and gives you the information you need to know. - Craig Smith, Unbound DNA A concise yet thorough description of the many facets of NoSQL, from big data to search. - John Guthrie, Pivotal Brings common sense to the world of NoSQL. - Ignacio Lopez Vellon, Atos Worldgrid Get ahead of your peers ... fast-track to NoSQL now! - Ian Stirk, Stirk Consultancy, Ltd

Enterprise Data Workflows with Cascading

2013-07-11 O'Reilly Amazon

book

Paco Nathan

data data-engineering Hadoop Big Data Java

There is an easier way to build Hadoop applications. With this hands-on book, you’ll learn how to use Cascading, the open source abstraction framework for Hadoop that lets you easily create and manage powerful enterprise-grade data processing applications—without having to learn the intricacies of MapReduce. Working with sample apps based on Java and other JVM languages, you’ll quickly learn Cascading’s streamlined approach to data processing, data filtering, and workflow optimization. This book demonstrates how this framework can help your business extract meaningful information from large amounts of distributed data. Start working on Cascading example projects right away Model and analyze unstructured data in any format, from any source Build and test applications with familiar constructs and reusable components Work with the Scalding and Cascalog Domain-Specific Languages Easily deploy applications to Hadoop, regardless of cluster location or data size Build workflows that integrate several big data frameworks and processes Explore common use cases for Cascading, including features and tools that support them Examine a case study that uses a dataset from the Open Data Initiative

IBM Information Server: Integration and Governance for Emerging Data Warehouse Demands

2013-07-10 O'Reilly Amazon

book

Holger Kache , Manish Bhide , Bob Kitzberger , Harald C. Smith , Chuck Ballard , Yeh-Heng Sheng , Beate Porst

data data-engineering IBM BI Big Data Data Governance

This IBM® Redbooks® publication is intended for business leaders and IT architects who are responsible for building and extending their data warehouse and Business Intelligence infrastructure. It provides an overview of powerful new capabilities of Information Server in the areas of big data, statistical models, data governance and data quality. The book also provides key technical details that IT professionals can use in solution planning, design, and implementation.

Learning SPARQL, 2nd Edition

2013-07-03 O'Reilly Amazon

book

Bob DuCharme

data data-engineering sparql Big Data

Gain hands-on experience with SPARQL, the RDF query language that’s bringing new possibilities to semantic web, linked data, and big data projects. This updated and expanded edition shows you how to use SPARQL 1.1 with a variety of tools to retrieve, manipulate, and federate data from the public web as well as from private sources. Author Bob DuCharme has you writing simple queries right away before providing background on how SPARQL fits into RDF technologies. Using short examples that you can run yourself with open source software, you’ll learn how to update, add to, and delete data in RDF datasets. Get the big picture on RDF, linked data, and the semantic web Use SPARQL to find bad data and create new data from existing data Use datatype metadata and functions in your queries Learn techniques and tools to help your queries run more efficiently Use RDF Schemas and OWL ontologies to extend the power of your queries Discover the roles that SPARQL can play in your applications

Apache Sqoop Cookbook

2013-07-02 O'Reilly Amazon

book

Kathleen Ting , Jarek Jarcec Cecho

data data-engineering Hadoop sqoop Big Data DWH

Integrating data from multiple sources is essential in the age of big data, but it can be a challenging and time-consuming task. This handy cookbook provides dozens of ready-to-use recipes for using Apache Sqoop, the command-line interface application that optimizes data transfers between relational databases and Hadoop. Sqoop is both powerful and bewildering, but with this cookbook’s problem-solution-discussion format, you’ll quickly learn how to deploy and then apply Sqoop in your environment. The authors provide MySQL, Oracle, and PostgreSQL database examples on GitHub that you can easily adapt for SQL Server, Netezza, Teradata, or other relational systems. Transfer data from a single database table into your Hadoop ecosystem Keep table data and Hadoop in sync by importing data incrementally Import data from more than one database table Customize transferred data by calling various database functions Export generated, processed, or backed-up data from Hadoop to your database Run Sqoop within Oozie, Hadoop’s specialized workflow scheduler Load data into Hadoop’s data warehouse (Hive) or database (HBase) Handle installation, connection, and syntax issues common to specific database vendors

The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition

2013-07-01 O'Reilly Amazon

book

Margy Ross , Ralph Kimball

data data-engineering storage-repositories data-warehouse Analytics BI

Updated new edition of Ralph Kimball's groundbreaking book on dimensional modeling for data warehousing and business intelligence! The first edition of Ralph Kimball's The Data Warehouse Toolkit introduced the industry to dimensional modeling, and now his books are considered the most authoritative guides in this space. This new third edition is a complete library of updated dimensional modeling techniques, the most comprehensive collection ever. It covers new and enhanced star schema dimensional modeling patterns, adds two new chapters on ETL techniques, includes new and expanded business matrices for 12 case studies, and more. Authored by Ralph Kimball and Margy Ross, known worldwide as educators, consultants, and influential thought leaders in data warehousing and business intelligence Begins with fundamental design recommendations and progresses through increasingly complex scenarios Presents unique modeling techniques for business applications such as inventory management, procurement, invoicing, accounting, customer relationship management, big data analytics, and more Draws real-world case studies from a variety of industries, including retail sales, financial services, telecommunications, education, health care, insurance, e-commerce, and more Design dimensional databases that are easy to understand and provide fast query response with The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Solr in Action

Apache Hadoop™ YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop™ 2

Microsoft Big Data Solutions

Optimizing Hadoop for MapReduce

Big Data

IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators

Data Just Right: Introduction to Large-Scale Data & Analytics

Big Data Application Architecture Q&A: A Problem - Solution Approach

Oracle NoSQL Database

Big Data Computing

Oracle® 12c For Dummies®

Securing Hadoop

The Definitive Guide to MongoDB: A complete guide to dealing with Big Data using MongoDB, Second Edition

The Culture of Big Data

DB2 10.5 with BLU Acceleration

Joe Celko’s Complete Guide to NoSQL

Oracle Big Data Handbook

Professional Hadoop Solutions

Oracle Essentials, 5th Edition

Making Sense of NoSQL

Enterprise Data Workflows with Cascading

IBM Information Server: Integration and Governance for Emerging Data Warehouse Demands

Learning SPARQL, 2nd Edition

Apache Sqoop Cookbook

The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition