O'Reilly Data Engineering Books

Big Data Analytics

2015-02-05 O'Reilly Amazon

book

Kim H. Pries , Robert Dunnigan

data data-engineering AI/ML Analytics Big Data Data Analytics

With this book, managers and decision makers are given the tools to make more informed decisions about big data purchasing initiatives. Big Data Analytics: A Practical Guide for Managers not only supplies descriptions of common tools, but also surveys the various products and vendors that supply the big data market. Comparing and contrasting the different types of analysis commonly conducted with big data, this accessible reference presents clear-cut explanations of the general workings of big data tools. Instead of spending time on HOW to install specific packages, it focuses on the reasons WHY readers would install a given package. The book provides authoritative guidance on a range of tools, including open source and proprietary systems. It details the strengths and weaknesses of incorporating big data analysis into decision-making and explains how to leverage the strengths while mitigating the weaknesses. Describes the benefits of distributed computing in simple terms Includes substantial vendor/tool material, especially for open source decisions Covers prominent software packages, including Hadoop and Oracle Endeca Examines GIS and machine learning applications Considers privacy and surveillance issues The book further explores basic statistical concepts that, when misapplied, can be the source of errors. Time and again, big data is treated as an oracle that discovers results nobody would have imagined. While big data can serve this valuable function, all too often these results are incorrect, yet are still reported unquestioningly. The probability of having erroneous results increases as a larger number of variables are compared unless preventative measures are taken. The approach taken by the authors is to explain these concepts so managers can ask better questions of their analysts and vendors as to the appropriateness of the methods used to arrive at a conclusion. Because the world of science and medicine has been grappling with similar issues in the publication of studies, the authors draw on their efforts and apply them to big data.

ElasticSearch Cookbook - Second Edition

2015-01-28 O'Reilly Amazon

book

Alberto Paro

data data-engineering search elasticsearch Analytics Big Data

The "ElasticSearch Cookbook - Second Edition" is a hands-on guide featuring over 130 advanced recipes to help you harness the power of ElasticSearch, a leading search and analytics engine. Through insightful examples and practical guidance, you'll learn to implement efficient search solutions, optimize queries, and manage ElasticSearch clusters effectively. What this Book will help me do Design and configure ElasticSearch topologies optimized for your specific deployment needs. Develop and utilize custom mappings to optimize your data indexes. Execute advanced queries and filters to refine and retrieve search results effectively. Set up and monitor ElasticSearch clusters for optimal performance. Extend ElasticSearch capabilities through plugin development and integrations using Java and Python. Author(s) Alberto Paro is a technology expert with years of experience working with ElasticSearch, Big Data solutions, and scalable cloud architecture. He has authored multiple books and technical articles on ElasticSearch, leveraging his extensive knowledge to provide practical insights. His approachable and detail-oriented style makes complex concepts accessible to technical professionals. Who is it for? This book is best suited for software developers and IT professionals looking to use ElasticSearch in their projects. Readers should be familiar with JSON, as well as basic programming skills in Java. It is ideal for those who have an understanding of search applications and want to deepen their expertise. Whether you're integrating ElasticSearch into a web application or optimizing your system's search capabilities, this book will provide the skills and knowledge you need.

Implementing the IBM Storwize V7000 Gen2

2015-01-18 O'Reilly Amazon

book

Nancy Kinney , Lev Sturmer , Jon Tate , Morten Dannemand , Massimo Rosati

data data-engineering IBM Big Data

Data is the new currency of business, the most critical asset of the modern organization. In fact, enterprises that can gain business insights from their data are twice as likely to outperform their competitors. Nevertheless, 72% of them have not started, or are only planning, big data activities. In addition, organizations often spend too much money and time managing where their data is stored. The average firm purchases 24% more storage every year, but uses less than half of the capacity that it already has. The IBM® Storwize® family, including the IBM SAN Volume Controller Data Platform, is a storage virtualization system that enables a single point of control for storage resources. This functionality helps support improved business application availability and greater resource use. The following list describes the business objectives of this system: To manage storage resources in your information technology (IT) infrastructure To make sure that those resources are used to the advantage of your business To do it quickly, efficiently, and in real time, while avoiding increases in administrative costs Storwize functions benefit all virtualized storage. For example, IBM Easy Tier® optimizes use of flash memory. In addition, IBM Real-time Compression™ enhances efficiency even further by enabling the storage of up to five times as much active primary data in the same physical disk space. Finally, high-performance thin provisioning helps automate provisioning. These benefits can help extend the useful life of existing storage assets, reducing costs. Integrating these functions into Storwize also means that they are designed to operate smoothly together, reducing management effort. This IBM Redbooks® publication provides information about the latest features and functions of the Storwize V7000 Gen2 and software version 7.3 implementation, architectural improvements, and Easy Tier.

Data Driven

2015-01-15 O'Reilly Amazon

book

Hilary Mason , DJ Patil

data data-engineering Hadoop Big Data

Succeeding with data isn’t just a matter of putting Hadoop in your machine room, or hiring some physicists with crazy math skills. It requires you to develop a data culture that involves people throughout the organization. In this O’Reilly report, DJ Patil and Hilary Mason outline the steps you need to take if your company is to be truly data-driven—including the questions you should ask and the methods you should adopt. You’ll not only learn examples of how Google, LinkedIn, and Facebook use their data, but also how Walmart, UPS, and other organizations took advantage of this resource long before the advent of Big Data. No matter how you approach it, building a data culture is the key to success in the 21st century. You’ll explore: Data scientist skills—and why every company needs a Spock How the benefits of giving company-wide access to data outweigh the costs Why data-driven organizations use the scientific method to explore and solve data problems Key questions to help you develop a research-specific process for tackling important issues What to consider when assembling your data team Developing processes to keep your data team (and company) engaged Choosing technologies that are powerful, support teamwork, and easy to use and learn

Getting a Big Data Job For Dummies

2015-01-12 O'Reilly Amazon

book

Jason Williamson

data data-engineering Big Data

Hone your analytic talents and become part of the next big thing Getting a Big Data Job For Dummies is the ultimate guide to landing a position in one of the fastest-growing fields in the modern economy. Learn exactly what "big data" means, why it's so important across all industries, and how you can obtain one of the most sought-after skill sets of the decade. This book walks you through the process of identifying your ideal big data job, shaping the perfect resume, and nailing the interview, all in one easy-to-read guide. Companies from all industries, including finance, technology, medicine, and defense, are harnessing massive amounts of data to reap a competitive advantage. The demand for big data professionals is growing every year, and experts forecast an estimated 1.9 million additional U.S. jobs in big data by 2015. Whether your niche is developing the technology, handling the data, or analyzing the results, turning your attention to a career in big data can lead to a more secure, more lucrative career path. Getting a Big Data Job For Dummies provides an overview of the big data career arc, and then shows you how to get your foot in the door with topics like: The education you need to succeed The range of big data career path options An overview of major big data employers A plan to develop your job-landing strategy Your analytic inclinations may be your ticket to long-lasting success. In a highly competitive job market, developing your data skills can create a situation where you pick your employer rather than the other way around. If you're ready to get in on the ground floor of the next big thing, Getting a Big Data Job For Dummies will teach you everything you need to know to get started today.

Practical Neo4j

2015-01-05 O'Reilly Amazon

book

Gregory Jordan

data data-engineering graph-databases Neo4j Big Data Data Modelling

" Why have developers at places like Facebook and Twitter increasingly turned to graph databases to manage their highly connected big data? The short answer is that graphs offer superior speed and flexibility to get the job done. It’s time you added skills in graph databases to your toolkit. In Practical Neo4j, database expert Greg Jordan guides you through the background and basics of graph databases and gets you quickly up and running with Neo4j, the most prominent graph database on the market today. Jordan walks you through the data modeling stages for projects such as social networks, recommendation engines, and geo-based applications. The book also dives into the configuration steps as well as the language options used to create your Neo4j-backed applications. Neo4j runs some of the largest connected datasets in the world, and developing with it offers you a fast, proven NoSQL database option. Besides those working for social media, database, and networking companies of all sizes, academics and researchers will find Neo4j a powerful research tool that can help connect large sets of diverse data and provide insights that would otherwise remain hidden. Using Practical Neo4j, you will learn how to harness that power and create elegant solutions that address complex data problems. This book: Explains the basics of graph databases Demonstrates how to configure and maintain Neo4j Shows how to import data into Neo4j from a variety of sources Provides a working example of a Neo4j-based application using an array of language of options including Java, .Net, PHP, Python, Spring, and Ruby As you’ll discover, Neo4j offers a blend of simplicity and speed while allowing data relationships to maintain first-class status. That’s one reason among many that such a wide range of industries and fields have turned to graph databases to analyze deep, dense relationships. After reading this book, you’ll have a potent, elegant tool you can use to develop projects profitably and improve your career options.

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

2014-12-30 O'Reilly Amazon

book

Michael Frampton

data data-engineering Hadoop Analytics Avro Big Data

Many corporations are finding that the size of their data sets are outgrowing the capability of their systems to store and process them. The data is becoming too big to manage and use with traditional tools. The solution: implementing a big data system. As Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset shows, Apache Hadoop offers a scalable, fault-tolerant system for storing and processing data in parallel. It has a very rich toolset that allows for storage (Hadoop), configuration (YARN and ZooKeeper), collection (Nutch and Solr), processing (Storm, Pig, and Map Reduce), scheduling (Oozie), moving (Sqoop and Avro), monitoring (Chukwa, Ambari, and Hue), testing (Big Top), and analysis (Hive). The problem is that the Internet offers IT pros wading into big data many versions of the truth and some outright falsehoods born of ignorance. What is needed is a book just like this one: a wide-ranging but easily understood set of instructions to explain where to get Hadoop tools, what they can do, how to install them, how to configure them, how to integrate them, and how to use them successfully. And you need an expert who has worked in this area for a decade—someone just like author and big data expert Mike Frampton. Big Data Made Easy approaches the problem of managing massive data sets from a systems perspective, and it explains the roles for each project (like architect and tester, for example) and shows how the Hadoop toolset can be used at each system stage. It explains, in an easily understood manner and through numerous examples, how to use each tool. The book also explains the sliding scale of tools available depending upon data size and when and how to use them. Big Data Made Easy shows developers and architects, as well as testers and project managers, how to: Store big data Configure big data Process big data Schedule processes Move data among SQL and NoSQL systems Monitor data Perform big data analytics Report on big data processes and projects Test big data systems Big Data Made Easy also explains the best part, which is that this toolset is free. Anyone can download it and—with the help of this book—start to use it within a day. With the skills this book will teach you under your belt, you will add value to your company or client immediately, not to mention your career.

Big Data Now: 2014 Edition

2014-12-12 O'Reilly Amazon

book

O'Reilly Media, Inc.

data data-engineering AI/ML Analytics API Big Data

In the four years that O'Reilly Media, Inc. has produced its annual Big Data Now report, the data field has grown from infancy into young adulthood. Data is now a leader in some fields and a driver of innovation in others, and companies that use data and analytics to drive decision-making are outperforming their peers. And while access to big data tools and techniques once required significant expertise, today many tools have improved and communities have formed to share best practices. Companies have also started to emphasize the importance of processes, culture, and people. The topics in represent the major forces currently shaping the data world: Big Data Now: 2014 Edition Cognitive augmentation: predictive APIs, graph analytics, and Network Science dashboards Intelligence matters: defining AI, modeling intelligence, deep learning, and "summoning the demon" Cheap sensors, fast networks, and distributed computing: stream processing, hardware data flows, and computing at the edge Data (science) pipelines: broadening the coverage of analytic pipelines with specialized tools Evolving marketplace of big data components: SSDs, Hadoop 2, Spark; and why datacenters need operating systems Design and social science: human-centered design, wearables and real-time communications, and wearable etiquette Building a data culture: moving from prediction to real-time adaptation; and why you need to become a data skeptic Perils of big data: data redlining, intrusive data analysis, and the state of big data ethics

Data Architecture: A Primer for the Data Scientist

2014-11-26 O'Reilly Amazon

book

Daniel Linstedt , W. H. Inmon

data data-engineering Analytics Big Data DWH

Today, the world is trying to create and educate data scientists because of the phenomenon of Big Data. And everyone is looking deeply into this technology. But no one is looking at the larger architectural picture of how Big Data needs to fit within the existing systems (data warehousing systems). Taking a look at the larger picture into which Big Data fits gives the data scientist the necessary context for how pieces of the puzzle should fit together. Most references on Big Data look at only one tiny part of a much larger whole. Until data gathered can be put into an existing framework or architecture it can’t be used to its full potential. Data Architecture a Primer for the Data Scientist addresses the larger architectural picture of how Big Data fits with the existing information infrastructure, an essential topic for the data scientist. Drawing upon years of practical experience and using numerous examples and an easy to understand framework. W.H. Inmon, and Daniel Linstedt define the importance of data architecture and how it can be used effectively to harness big data within existing systems. You’ll be able to: Turn textual information into a form that can be analyzed by standard tools. Make the connection between analytics and Big Data Understand how Big Data fits within an existing systems environment Conduct analytics on repetitive and non-repetitive data Discusses the value in Big Data that is often overlooked, non-repetitive data, and why there is significant business value in using it Shows how to turn textual information into a form that can be analyzed by standard tools Explains how Big Data fits within an existing systems environment Presents new opportunities that are afforded by the advent of Big Data Demystifies the murky waters of repetitive and non-repetitive data in Big Data

Learning Hbase

2014-11-25 O'Reilly Amazon

book

Shashwat Shriparv

data data-engineering nosql-databases Apache HBase Analytics Big Data

In "Learning HBase", you'll dive deep into the core functionalities of Apache HBase and understand its applications in handling Big Data environments. By exploring both theoretical concepts and practical scenarios, you'll acquire the skills to set up, manage, and optimize HBase clusters. What this Book will help me do Understand and explain the components of the HBase ecosystem. Install and configure HBase clusters for optimized performance. Develop and maintain applications using HBase's structured storage model. Troubleshoot and resolve common issues in HBase deployments. Leverage Hadoop tools and advanced techniques to enhance HBase capabilities. Author(s) None Shriparv is a skilled technologist with a robust background in Big Data tools and application development. With hands-on expertise in distributed storage systems and data analytics, they lend exceptional insights into managing HBase environments. Their approach combines clarity, practicality, and a focus on real-world applicability. Who is it for? This book is ideal for system administrators and developers who are starting their journey in Big Data technology. With clear explanations and hands-on scenarios, it suits those seeking foundational and intermediate knowledge of the HBase ecosystem. Suitably designed, it helps students, early-career professionals, and mid-level technologists enhance their expertise. If you work in Big Data and want to grow your skill set in distributed storage systems, this book is for you.

The Big Data-Driven Business

2014-11-24 O'Reilly Amazon

book

Russell Glass , Sean Callahan

data data-engineering Big Data Marketing Cyber Security

Get the expert perspective and practical advice on big data The Big Data-Driven Business: How to Use Big Data to Win Customers, Beat Competitors, and Boost Profits makes the case that big data is for real, and more than just big hype. The book uses real-life examples—from Nate Silver to Copernicus, and Apple to Blackberry—to demonstrate how the winners of the future will use big data to seek the truth. Written by a marketing journalist and the CEO of a multi-million-dollar B2B marketing platform that reaches more than 90% of the U.S. business population, this book is a comprehensive and accessible guide on how to win customers, beat competitors, and boost the bottom line with big data. The marketplace has entered an era where the customer holds all the cards. With unprecedented choice in both the consumer world and the B2B world, it's imperative that businesses gain a greater understanding of their customers and prospects. Big data is the key to this insight, because it provides a comprehensive view of a company's customers—who they are, and who they may be tomorrow. The Big Data-Driven Business is a complete guide to the future of business as seen through the lens of big data, with expert advice on real-world applications. Learn what big data is, and how it will transform the enterprise Explore why major corporations are betting their companies on marketing technology Read case studies of big data winners and losers Discover how to change privacy and security, and remodel marketing Better information allows for better decisions, better targeting, and better reach. Big data has become an indispensable tool for the most effective marketers in the business, and it's becoming less of a competitive advantage and more like an industry standard. Remaining relevant as the marketplace evolves requires a full understanding and application of big data, and The Big Data-Driven Business provides the practical guidance businesses need.

Hbase Essentials

2014-11-14 O'Reilly Amazon

book

Nishant Garg

data data-engineering nosql-databases Apache HBase Big Data Hadoop

Hbase Essentials provides a hands-on introduction to HBase, a distributed database built on top of the Hadoop ecosystem. Through practical examples and clear explanations, you will learn how to set up, use, and administer HBase to manage high-volume, high-velocity data efficiently. What this Book will help me do Understand the importance and use cases of HBase for managing Big Data. Successfully set up and configure an HBase cluster in your environment. Develop data models in HBase and perform CRUD operations effectively. Learn advanced HBase features like counters, coprocessors, and integration with MapReduce. Master cluster management and performance tuning for optimal HBase operations. Author(s) None Garg is a seasoned Big Data engineer with extensive experience in distributed databases and the Hadoop ecosystem. Having worked on complex data systems, None brings practical insights to understanding and implementing HBase. Known for a clear and approachable writing style, None aims to make learning technical subjects accessible. Who is it for? Hbase Essentials is ideal for developers and Big Data engineers keen to build expertise in distributed databases. If you have a basic understanding of HDFS or MapReduce or have experience with NoSQL databases, this book will accelerate your knowledge of HBase. It's tailored for those seeking to leverage HBase for scalable and reliable data solutions. Whether you're starting with HBase or expanding your Big Data skillset, this guide provides the tools to succeed.

IBM SAN Volume Controller 2145-DH8 Introduction and Implementation

2014-10-19 O'Reilly Amazon

book

Jon Tate , Libor Miklas , Andrew Beattie , Christian Burns , Torben Jensen

data data-engineering IBM Big Data

Data is the new currency of business, the most critical asset of the modern organization. In fact, enterprises that can gain business insights from their data are twice as likely to outperform their competitors; yet, 72 percent of them have not started or are only planning big data activities. In addition, organizations often spend too much money and time managing where their data is stored. The average firm purchases 24% more storage every year, but uses less than half of the capacity it already has. A member of the IBM® Storwize® family, IBM SAN Volume Controller (SVC) Data Platform is a storage virtualization system that enables a single point of control for storage resources to help support improved business application availability and greater resource utilization. The objective is to manage storage resources in your IT infrastructure and to make sure they are used to the advantage of your business, and do it quickly, efficiently, and in real time, while avoiding increases in administrative costs. Virtualizing storage with SVC Data Platform helps make new and existing storage more effective. SVC Data Platform includes many functions traditionally deployed separately in disk systems. By including these in a virtualization system, SVC Data Platform standardizes functions across virtualized storage for greater flexibility and potentially lower costs. SVC Data Platform functions benefit all virtualized storage. For example, IBM Easy Tier® optimizes use of flash storage. And IBM Real-time Compression™ enhances efficiency even further by enabling the storage of up to five times as much active primary data in the same physical disk space. Finally, high-performance thin provisioning helps automate provisioning. These benefits can help extend the useful life of existing storage assets, reducing costs. Integrating these functions into SVC Data Platform also means that they are designed to operate smoothly together, reducing management effort. In this IBM Redbooks® publication, we discuss the latest features and functions of the SVC 2145-DH8 and software version 7.3, implementation, architectural improvements, and Easy Tier.

Hadoop in Practice, Second Edition

2014-09-29 O'Reilly Amazon

book

Alex Holmes

data data-engineering Hadoop AI/ML Analytics Big Data

Hadoop in Practice, Second Edition provides over 100 tested, instantly useful techniques that will help you conquer big data, using Hadoop. This revised new edition covers changes and new features in the Hadoop core architecture, including MapReduce 2. Brand new chapters cover YARN and integrating Kafka, Impala, and Spark SQL with Hadoop. You'll also get new and updated techniques for Flume, Sqoop, and Mahout, all of which have seen major new versions recently. In short, this is the most practical, up-to-date coverage of Hadoop available anywhere About the Technology About the Book It's always a good time to upgrade your Hadoop skills! Hadoop in Practice, Second Edition provides a collection of 104 tested, instantly useful techniques for analyzing real-time streams, moving data securely, machine learning, managing large-scale clusters, and taming big data using Hadoop. This completely revised edition covers changes and new features in Hadoop core, including MapReduce 2 and YARN. You'll pick up hands-on best practices for integrating Spark, Kafka, and Impala with Hadoop, and get new and updated techniques for the latest versions of Flume, Sqoop, and Mahout. In short, this is the most practical, up-to-date coverage of Hadoop available. Readers need to know a programming language like Java and have basic familiarity with Hadoop. What's Inside Thoroughly updated for Hadoop 2 How to write YARN applications Integrate real-time technologies like Storm, Impala, and Spark Predictive analytics using Mahout and RR About the Reader About the Author Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. Quotes Very insightful. A deep dive into the Hadoop world. - Andrea Tarocchi, Red Hat, Inc. The most complete material on Hadoop and its ecosystem known to mankind! - Arthur Zubarev, Vital Insights Clear and concise, full of insights and highly applicable information. - Edward de Oliveira Ribeiro, DataStax, Inc. Comprehensive up-to-date coverage of Hadoop 2. - Muthusamy Manigandan, OzoneMedia

Getting Started with Impala

2014-09-25 O'Reilly Amazon

book

John Russell

data data-engineering Hadoop impala Analytics Big Data

Learn how to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala—the massively parallel processing SQL query engine for Apache Hadoop. The best practices in this practical guide help you design database schemas that not only interoperate with other Hadoop components, and are convenient for administers to manage and monitor, but also accommodate future expansion in data size and evolution of software capabilities. Written by John Russell, documentation lead for the Cloudera Impala project, this book gets you working with the most recent Impala releases quickly. Ideal for database developers and business analysts, the latest revision covers analytics functions, complex types, incremental statistics, subqueries, and submission to the Apache incubator. Getting Started with Impala includes advice from Cloudera’s development team, as well as insights from its consulting engagements with customers. Learn how Impala integrates with a wide range of Hadoop components Attain high performance and scalability for huge data sets on production clusters Explore common developer tasks, such as porting code to Impala and optimizing performance Use tutorials for working with billion-row tables, date- and time-based values, and other techniques Learn how to transition from rigid schemas to a flexible model that evolves as needs change Take a deep dive into joins and the roles of statistics

Master Competitive Analytics with Oracle Endeca Information Discovery

2014-09-24 O'Reilly Amazon

book

William Smith , Helen Sun

data data-engineering oracle-database-solutions Analytics Big Data Oracle

Oracle Endeca Information Discovery Best Practices Maximize the powerful capabilities of this self-service enterprise data discovery platform. Master Competitive Analytics with Oracle Endeca Information Discovery reveals how to unlock insights from any type of data, regardless of structure. The first part of the book is a complete technical guide to the product's architecture, components, and implementation. The second part presents a comprehensive collection of business analytics use cases in various industries, including financial services, healthcare, research, manufacturing, retail, consumer packaged goods, and public sector. Step-by-step instructions on implementing some of these use cases are included in this Oracle Press book. Install and manage Oracle Endeca Server Design Oracle Endeca Information Discovery Studio visualizations to facilitate user-driven data exploration and discovery Enable enterprise-driven data exploration with Oracle Endeca Information Discovery Integrator Develop and implement a fraud detection and analysis application Build a healthcare correlation application that integrates claims, patient, and operations analysis; partners; clinical research; and remote monitoring Use an enterprise architecture approach to incrementally establish big data and analytical capabilities

Sams Teach Yourself NoSQL with MongoDB in 24 Hours

2014-09-08 O'Reilly Amazon

book

Brad Dayley

data data-engineering nosql-databases MongoDB Big Data Java

NoSQL database usage is growing at a stunning 50% per year, as organizations discover NoSQL's potential to address even the most challenging Big Data and real-time database problems. Every NoSQL database is different, but one is the most popular by far: MongoDB. Now, in just 24 lessons of one hour or less, you can learn how to leverage MongoDB's immense power. Each short, easy lesson builds on all that's come before, teaching NoSQL concepts and MongoDB techniques from the ground up. Sams Teach Yourself NoSQL with MongoDB in 24 Hours covers all this, and much more: Learning how NoSQL is different, when to use it, and when to use traditional RDBMSes instead Designing and implementing MongoDB databases of diverse types and sizes Storing and interacting with data via Java, PHP, Python, and Node.js/Mongoose Choosing the right NoSQL distribution model for your application Installing and configuring MongoDB Designing MongoDB data models, including collections, indexes, and GridFS Balancing consistency, performance, and durability Leveraging the immense power of Map-Reduce Administering, monitoring, securing, backing up, and repairing MongoDB databases Mastering advanced techniques such as sharding and replication Optimizing performance

Pro Apache Hadoop, Second Edition

2014-09-03 O'Reilly Amazon

book

Madhu Siddalingaiah , Sameer Wadkar

data data-engineering Hadoop Big Data Cloud Computing HDFS

Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop the framework of big data. Revised to cover Hadoop 2.0, the book covers the very latest developments such as YARN (aka MapReduce 2.0), new HDFS high-availability features, and increased scalability in the form of HDFS Federations. All the old content has been revised too, giving the latest on the ins and outs of MapReduce, cluster design, the Hadoop Distributed File System, and more. This book covers everything you need to build your first Hadoop cluster and begin analyzing and deriving value from your business and scientific data. Learn to solve big-data problems the MapReduce way, by breaking a big problem into chunks and creating small-scale solutions that can be flung across thousands upon thousands of nodes to analyze large data volumes in a short amount of wall-clock time. Learn how to let Hadoop take care of distributing and parallelizing your softwareyou just focus on the code; Hadoop takes care of the rest. Covers all that is new in Hadoop 2.0 Written by a professional involved in Hadoop since day one Takes you quickly to the seasoned pro level on the hottest cloud-computing framework

Cloudera Administration Handbook

2014-07-18 O'Reilly Amazon

book

Rohit Menon

data data-engineering Hadoop cloudera Big Data HDFS

Discover how to effectively administer large Apache Hadoop clusters with the Cloudera Administration Handbook. This guide offers step-by-step instructions and practical examples, enabling you to confidently set up and manage Hadoop environments using Cloudera Manager and CDH5 tools. Through this book, administrators or aspiring experts can unlock the power of distributed computing and streamline cluster operations. What this Book will help me do Gain in-depth understanding of Apache Hadoop architecture and its operational framework. Master the setup, configuration, and management of Hadoop clusters using Cloudera tools. Implement robust security measures in your cluster including Kerberos authentication. Optimize for reliability with advanced HDFS features like High Availability and Federation. Streamline cluster management and address troubleshooting effectively using best practices. Author(s) None Menon is an experienced technologist specializing in distributed computing and data infrastructure. With a strong background in big data platforms and certifications in Hadoop administration, None has helped enterprises optimize their cluster deployments. Their instructional approach combines clarity, practical insights, and a hands-on focus. Who is it for? This book is ideal for systems administrators, data engineers, and IT professionals keen on mastering Hadoop environments. It serves both beginners getting started with cluster setup and seasoned administrators seeking advanced configurations. If you're aiming to efficiently manage Hadoop clusters using Cloudera solutions, this guide provides the knowledge and tools you need.

Understanding Big Data Scalability: Big Data Scalability Series, Part I

2014-07-11 O'Reilly Amazon

book

Cory Isaacson

data data-engineering nosql-databases Big Data Hadoop NoSQL

Get Started Scaling Your Database Infrastructure for High-Volume Big Data Applications “Understanding Big Data Scalability presents the fundamentals of scaling databases from a single node to large clusters. It provides a practical explanation of what ‘Big Data’ systems are, and fundamental issues to consider when optimizing for performance and scalability. Cory draws on many years of experience to explain issues involved in working with data sets that can no longer be handled with single, monolithic relational databases.... His approach is particularly relevant now that relational data models are making a comeback via SQL interfaces to popular NoSQL databases and Hadoop distributions.... This book should be especially useful to database practitioners new to scaling databases beyond traditional single node deployments.” —Brian O’Krafka, software architect presents a solid foundation for scaling Big Data infrastructure and helps you address each crucial factor associated with optimizing performance in scalable and dynamic Big Data clusters. Understanding Big Data Scalability Database expert Cory Isaacson offers practical, actionable insights for every technical professional who must scale a database tier for high-volume applications. Focusing on today’s most common Big Data applications, he introduces proven ways to manage unprecedented data growth from widely diverse sources and to deliver real-time processing at levels that were inconceivable until recently. Isaacson explains why databases slow down, reviews each major technique for scaling database applications, and identifies the key rules of database scalability that every architect should follow. You’ll find insights and techniques proven with all types of database engines and environments, including SQL, NoSQL, and Hadoop. Two start-to-finish case studies walk you through planning and implementation, offering specific lessons for formulating your own scalability strategy. Coverage includes Understanding the true causes of database performance degradation in today’s Big Data environments Scaling smoothly to petabyte-class databases and beyond Defining database clusters for maximum scalability and performance Integrating NoSQL or columnar databases that aren’t “drop-in” replacements for RDBMSes Scaling application components: solutions and options for each tier Recognizing when to scale your data tier—a decision with enormous consequences for your application environment Why data relationships may be even more important in non-relational databases Why virtually every database scalability implementation still relies on sharding, and how to choose the best approach How to set clear objectives for architecting high-performance Big Data implementations The Big Data Scalability Series is a comprehensive, four-part series, containing information on many facets of database performance and scalability. is the first book in the series. Understanding Big Data Scalability Learn more and join the conversation about Big Data scalability at bigdatascalability.com.

Large Scale and Big Data

2014-06-25 O'Reilly Amazon

book

Sherif Sakr , Mohamed Gaber

data data-engineering AI/ML Analytics Big Data Cloud Computing

Large Scale and Big Data: Processing and Management provides readers with a central source of reference on the data management techniques currently available for large-scale data processing. Presenting chapters written by leading researchers, academics, and practitioners, it addresses the fundamental challenges associated with Big Data processing tools and techniques across a range of computing environments. The book begins by discussing the basic concepts and tools of large-scale Big Data processing and cloud computing. It also provides an overview of different programming models and cloud-based deployment models. The book’s second section examines the usage of advanced Big Data processing techniques in different domains, including semantic web, graph processing, and stream processing. The third section discusses advanced topics of Big Data processing such as consistency management, privacy, and security. Supplying a comprehensive summary from both the research and applied perspectives, the book covers recent research discoveries and applications, making it an ideal reference for a wide range of audiences, including researchers and academics working on databases, data mining, and web scale data processing. After reading this book, you will gain a fundamental understanding of how to use Big Data-processing tools and techniques effectively across application domains. Coverage includes cloud data management architectures, big data analytics visualization, data management, analytics for vast amounts of unstructured data, clustering, classification, link analysis of big data, scalable data mining, and machine learning techniques.

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives

2014-05-07 O'Reilly Amazon

book

Vijay Srinivas Agneeswaran Ph.D

data data-engineering Hadoop AI/ML Analytics Big Data

Master alternative Big Data technologies that can do what Hadoop can't: real-time analytics and iterative machine learning. When most technical professionals think of Big Data analytics today, they think of Hadoop. But there are many cutting-edge applications that Hadoop isn't well suited for, especially real-time analytics and contexts requiring the use of iterative machine learning algorithms. Fortunately, several powerful new technologies have been developed specifically for use cases such as these. Big Data Analytics Beyond Hadoop is the first guide specifically designed to help you take the next steps beyond Hadoop. Dr. Vijay Srinivas Agneeswaran introduces the breakthrough Berkeley Data Analysis Stack (BDAS) in detail, including its motivation, design, architecture, Mesos cluster management, performance, and more. He presents realistic use cases and up-to-date example code for: Spark, the next generation in-memory computing technology from UC Berkeley Storm, the parallel real-time Big Data analytics technology from Twitter GraphLab, the next-generation graph processing paradigm from CMU and the University of Washington (with comparisons to alternatives such as Pregel and Piccolo) Halo also offers architectural and design guidance and code sketches for scaling machine learning algorithms to Big Data, and then realizing them in real-time. He concludes by previewing emerging trends, including real-time video analytics, SDNs, and even Big Data governance, security, and privacy issues. He identifies intriguing startups and new research possibilities, including BDAS extensions and cutting-edge model-driven analytics. Big Data Analytics Beyond Hadoop is an indispensable resource for everyone who wants to reach the cutting edge of Big Data analytics, and stay there: practitioners, architects, programmers, data scientists, researchers, startup entrepreneurs, and advanced students.

Pig Design Patterns

2014-04-17 O'Reilly Amazon

book

Pradeep Pasupuleti

data data-engineering Hadoop pig Analytics Big Data

Discover how to simplify Hadoop programming with Pig Design Patterns, helping you create innovative enterprise-level big data solutions. This book takes you step-by-step through practical design patterns for creating efficient data processing workflows with Apache Pig. What this Book will help me do Understand and implement fundamental data processing patterns with Pig. Master advanced Pig techniques for Big Data analytics. Learn to optimize Pig scripts for performance and scalability. Build end-to-end data processing solutions with real-world examples. Integrate Pig workflows into the broader Hadoop ecosystem. Author(s) Pradeep Pasupuleti is an experienced data engineer and software developer specializing in Big Data technologies. With extensive expertise in Hadoop and Pig, Pradeep shares valuable insights and practical techniques beginners and experts alike will appreciate. Who is it for? This book is perfect for software developers and data engineers working with Hadoop who want to streamline their workflow. It is ideal for professionals already familiar with Pig and Hadoop basics looking to advance. It also suits learners aiming to implement optimized data solutions effectively.

Hadoop For Dummies

2014-04-14 O'Reilly Amazon

book

Dirk deRoos

data data-engineering Hadoop Analytics Big Data Data Science

Let Hadoop For Dummies help harness the power of your data and rein in the information overload Big data has become big business, and companies and organizations of all sizes are struggling to find ways to retrieve valuable information from their massive data sets with becoming overwhelmed. Enter Hadoop and this easy-to-understand For Dummies guide. Hadoop For Dummies helps readers understand the value of big data, make a business case for using Hadoop, navigate the Hadoop ecosystem, and build and manage Hadoop applications and clusters. Explains the origins of Hadoop, its economic benefits, and its functionality and practical applications Helps you find your way around the Hadoop ecosystem, program MapReduce, utilize design patterns, and get your Hadoop cluster up and running quickly and easily Details how to use Hadoop applications for data mining, web analytics and personalization, large-scale text processing, data science, and problem-solving Shows you how to improve the value of your Hadoop cluster, maximize your investment in Hadoop, and avoid common pitfalls when building your Hadoop cluster From programmers challenged with building and maintaining affordable, scaleable data systems to administrators who must deal with huge volumes of information effectively and efficiently, this how-to has something to help you with Hadoop.

Think Bigger

2014-04-02 O'Reilly Amazon

book

Mark Van Rijmenam

data data-engineering storage-repositories data-lake Big Data Hadoop

Big data--the enormous amount of data that is created as virtually every movement, transaction, and choice we make becomes digitized--is revolutionizing business. Offering real-world insight and explanations, this book provides a roadmap for organizations looking to develop a profitable big data strategy...and reveals why it's not something they can leave to the I.T. department.

Sharing best practices from companies that have implemented a big data strategy including Walmart, InterContinental Hotel Group, Walt Disney, and Shell, Think Bigger covers the most important big data trends affecting organizations, as well as key technologies like Hadoop and MapReduce, and several crucial types of analyses. In addition, the book offers guidance on how to ensure security, and respect the privacy rights of consumers. It also examines in detail how big data is impacting specific industries--and where opportunities can be found.

Big data is changing the way businesses--and even governments--are operated and managed. Think Bigger is an essential resource for anyone who wants to ensure that their company isn't left in the dust.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Big Data Analytics

ElasticSearch Cookbook - Second Edition

Implementing the IBM Storwize V7000 Gen2

Data Driven

Getting a Big Data Job For Dummies

Practical Neo4j

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Big Data Now: 2014 Edition

Data Architecture: A Primer for the Data Scientist

Learning Hbase

The Big Data-Driven Business

Hbase Essentials

IBM SAN Volume Controller 2145-DH8 Introduction and Implementation

Hadoop in Practice, Second Edition

Getting Started with Impala

Master Competitive Analytics with Oracle Endeca Information Discovery

Sams Teach Yourself NoSQL with MongoDB in 24 Hours

Pro Apache Hadoop, Second Edition

Cloudera Administration Handbook

Understanding Big Data Scalability: Big Data Scalability Series, Part I

Large Scale and Big Data

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives

Pig Design Patterns

Hadoop For Dummies

Think Bigger