Hadoop

Hadoop in Practice, Second Edition

2014-09-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alex Holmes

AI/ML Analytics Big Data Java Kafka Spark SQL data data-engineering

Hadoop in Practice, Second Edition provides over 100 tested, instantly useful techniques that will help you conquer big data, using Hadoop. This revised new edition covers changes and new features in the Hadoop core architecture, including MapReduce 2. Brand new chapters cover YARN and integrating Kafka, Impala, and Spark SQL with Hadoop. You'll also get new and updated techniques for Flume, Sqoop, and Mahout, all of which have seen major new versions recently. In short, this is the most practical, up-to-date coverage of Hadoop available anywhere About the Technology About the Book It's always a good time to upgrade your Hadoop skills! Hadoop in Practice, Second Edition provides a collection of 104 tested, instantly useful techniques for analyzing real-time streams, moving data securely, machine learning, managing large-scale clusters, and taming big data using Hadoop. This completely revised edition covers changes and new features in Hadoop core, including MapReduce 2 and YARN. You'll pick up hands-on best practices for integrating Spark, Kafka, and Impala with Hadoop, and get new and updated techniques for the latest versions of Flume, Sqoop, and Mahout. In short, this is the most practical, up-to-date coverage of Hadoop available. Readers need to know a programming language like Java and have basic familiarity with Hadoop. What's Inside Thoroughly updated for Hadoop 2 How to write YARN applications Integrate real-time technologies like Storm, Impala, and Spark Predictive analytics using Mahout and RR About the Reader About the Author Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. Quotes Very insightful. A deep dive into the Hadoop world. - Andrea Tarocchi, Red Hat, Inc. The most complete material on Hadoop and its ecosystem known to mankind! - Arthur Zubarev, Vital Insights Clear and concise, full of insights and highly applicable information. - Edward de Oliveira Ribeiro, DataStax, Inc. Comprehensive up-to-date coverage of Hadoop 2. - Muthusamy Manigandan, OzoneMedia

Getting Started with Impala

2014-09-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by John Russell

Analytics Big Data SQL data data-engineering impala

Learn how to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala—the massively parallel processing SQL query engine for Apache Hadoop. The best practices in this practical guide help you design database schemas that not only interoperate with other Hadoop components, and are convenient for administers to manage and monitor, but also accommodate future expansion in data size and evolution of software capabilities. Written by John Russell, documentation lead for the Cloudera Impala project, this book gets you working with the most recent Impala releases quickly. Ideal for database developers and business analysts, the latest revision covers analytics functions, complex types, incremental statistics, subqueries, and submission to the Apache incubator. Getting Started with Impala includes advice from Cloudera’s development team, as well as insights from its consulting engagements with customers. Learn how Impala integrates with a wide range of Hadoop components Attain high performance and scalability for huge data sets on production clusters Explore common developer tasks, such as porting code to Impala and optimizing performance Use tutorials for working with billion-row tables, date- and time-based values, and other techniques Learn how to transition from rigid schemas to a flexible model that evolves as needs change Take a deep dive into joins and the roles of statistics

Using Flume

2014-09-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hari Shreedharan

API ELK GitHub Apache HBase HDFS Data Streaming data data-engineering log-data

How can you get your data from frontend servers to Hadoop in near real time? With this complete reference guide, you’ll learn Flume’s rich set of features for collecting, aggregating, and writing large amounts of streaming data to the Hadoop Distributed File System (HDFS), Apache HBase, SolrCloud, Elastic Search, and other systems. Using Flume shows operations engineers how to configure, deploy, and monitor a Flume cluster, and teaches developers how to write Flume plugins and custom components for their specific use-cases. You’ll learn about Flume’s design and implementation, as well as various features that make it highly scalable, flexible, and reliable. Code examples and exercises are available on GitHub. Learn how Flume provides a steady rate of flow by acting as a buffer between data producers and consumers Dive into key Flume components, including sources that accept data and sinks that write and deliver it Write custom plugins to customize the way Flume receives, modifies, formats, and writes data Explore APIs for sending data to Flume agents from your own applications Plan and deploy Flume in a scalable and flexible way—and monitor your cluster once it’s running

Pro Apache Hadoop, Second Edition

2014-09-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Madhu Siddalingaiah , Sameer Wadkar

Big Data Cloud Computing HDFS data data-engineering

Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop the framework of big data. Revised to cover Hadoop 2.0, the book covers the very latest developments such as YARN (aka MapReduce 2.0), new HDFS high-availability features, and increased scalability in the form of HDFS Federations. All the old content has been revised too, giving the latest on the ins and outs of MapReduce, cluster design, the Hadoop Distributed File System, and more. This book covers everything you need to build your first Hadoop cluster and begin analyzing and deriving value from your business and scientific data. Learn to solve big-data problems the MapReduce way, by breaking a big problem into chunks and creating small-scale solutions that can be flung across thousands upon thousands of nodes to analyze large data volumes in a short amount of wall-clock time. Learn how to let Hadoop take care of distributing and parallelizing your softwareyou just focus on the code; Hadoop takes care of the rest. Covers all that is new in Hadoop 2.0 Written by a professional involved in Hadoop since day one Takes you quickly to the seasoned pro level on the hottest cloud-computing framework

Cloudera Administration Handbook

2014-07-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rohit Menon

Big Data HDFS Cyber Security cloudera data data-engineering

Discover how to effectively administer large Apache Hadoop clusters with the Cloudera Administration Handbook. This guide offers step-by-step instructions and practical examples, enabling you to confidently set up and manage Hadoop environments using Cloudera Manager and CDH5 tools. Through this book, administrators or aspiring experts can unlock the power of distributed computing and streamline cluster operations. What this Book will help me do Gain in-depth understanding of Apache Hadoop architecture and its operational framework. Master the setup, configuration, and management of Hadoop clusters using Cloudera tools. Implement robust security measures in your cluster including Kerberos authentication. Optimize for reliability with advanced HDFS features like High Availability and Federation. Streamline cluster management and address troubleshooting effectively using best practices. Author(s) None Menon is an experienced technologist specializing in distributed computing and data infrastructure. With a strong background in big data platforms and certifications in Hadoop administration, None has helped enterprises optimize their cluster deployments. Their instructional approach combines clarity, practical insights, and a hands-on focus. Who is it for? This book is ideal for systems administrators, data engineers, and IT professionals keen on mastering Hadoop environments. It serves both beginners getting started with cluster setup and seasoned administrators seeking advanced configurations. If you're aiming to efficiently manage Hadoop clusters using Cloudera solutions, this guide provides the knowledge and tools you need.

Understanding Big Data Scalability: Big Data Scalability Series, Part I

2014-07-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Cory Isaacson

Big Data NoSQL RDBMS SQL data data-engineering nosql-databases

Get Started Scaling Your Database Infrastructure for High-Volume Big Data Applications “Understanding Big Data Scalability presents the fundamentals of scaling databases from a single node to large clusters. It provides a practical explanation of what ‘Big Data’ systems are, and fundamental issues to consider when optimizing for performance and scalability. Cory draws on many years of experience to explain issues involved in working with data sets that can no longer be handled with single, monolithic relational databases.... His approach is particularly relevant now that relational data models are making a comeback via SQL interfaces to popular NoSQL databases and Hadoop distributions.... This book should be especially useful to database practitioners new to scaling databases beyond traditional single node deployments.” —Brian O’Krafka, software architect presents a solid foundation for scaling Big Data infrastructure and helps you address each crucial factor associated with optimizing performance in scalable and dynamic Big Data clusters. Understanding Big Data Scalability Database expert Cory Isaacson offers practical, actionable insights for every technical professional who must scale a database tier for high-volume applications. Focusing on today’s most common Big Data applications, he introduces proven ways to manage unprecedented data growth from widely diverse sources and to deliver real-time processing at levels that were inconceivable until recently. Isaacson explains why databases slow down, reviews each major technique for scaling database applications, and identifies the key rules of database scalability that every architect should follow. You’ll find insights and techniques proven with all types of database engines and environments, including SQL, NoSQL, and Hadoop. Two start-to-finish case studies walk you through planning and implementation, offering specific lessons for formulating your own scalability strategy. Coverage includes Understanding the true causes of database performance degradation in today’s Big Data environments Scaling smoothly to petabyte-class databases and beyond Defining database clusters for maximum scalability and performance Integrating NoSQL or columnar databases that aren’t “drop-in” replacements for RDBMSes Scaling application components: solutions and options for each tier Recognizing when to scale your data tier—a decision with enormous consequences for your application environment Why data relationships may be even more important in non-relational databases Why virtually every database scalability implementation still relies on sharding, and how to choose the best approach How to set clear objectives for architecting high-performance Big Data implementations The Big Data Scalability Series is a comprehensive, four-part series, containing information on many facets of database performance and scalability. is the first book in the series. Understanding Big Data Scalability Learn more and join the conversation about Big Data scalability at bigdatascalability.com.

Google BigQuery Analytics

2014-06-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Siddartha Naidu , Jordan Tigani (MotherDuck)

Analytics API BigQuery Python Data Streaming Tableau data data-engineering google-bigquery

How to effectively use BigQuery, avoid common mistakes, and execute sophisticated queries against large datasets Google BigQuery Analytics is the perfect guide for business and data analysts who want the latest tips on running complex queries and writing code to communicate with the BigQuery API. The book uses real-world examples to demonstrate current best practices and techniques, and also explains and demonstrates streaming ingestion, transformation via Hadoop in Google Compute engine, AppEngine datastore integration, and using GViz with Tableau to generate charts of query results. In addition to the mechanics of BigQuery, the book also covers the architecture of the underlying Dremel query engine, providing a thorough understanding that leads to better query results. Features a companion website that includes all code and data sets from the book Uses real-world examples to explain everything analysts need to know to effectively use BigQuery Includes web application examples coded in Python

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives

2014-05-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vijay Srinivas Agneeswaran Ph.D

AI/ML Analytics Big Data Data Analytics Data Governance Cyber Security Spark data data-engineering

Master alternative Big Data technologies that can do what Hadoop can't: real-time analytics and iterative machine learning. When most technical professionals think of Big Data analytics today, they think of Hadoop. But there are many cutting-edge applications that Hadoop isn't well suited for, especially real-time analytics and contexts requiring the use of iterative machine learning algorithms. Fortunately, several powerful new technologies have been developed specifically for use cases such as these. Big Data Analytics Beyond Hadoop is the first guide specifically designed to help you take the next steps beyond Hadoop. Dr. Vijay Srinivas Agneeswaran introduces the breakthrough Berkeley Data Analysis Stack (BDAS) in detail, including its motivation, design, architecture, Mesos cluster management, performance, and more. He presents realistic use cases and up-to-date example code for: Spark, the next generation in-memory computing technology from UC Berkeley Storm, the parallel real-time Big Data analytics technology from Twitter GraphLab, the next-generation graph processing paradigm from CMU and the University of Washington (with comparisons to alternatives such as Pregel and Piccolo) Halo also offers architectural and design guidance and code sketches for scaling machine learning algorithms to Big Data, and then realizing them in real-time. He concludes by previewing emerging trends, including real-time video analytics, SDNs, and even Big Data governance, security, and privacy issues. He identifies intriguing startups and new research possibilities, including BDAS extensions and cutting-edge model-driven analytics. Big Data Analytics Beyond Hadoop is an indispensable resource for everyone who wants to reach the cutting edge of Big Data analytics, and stay there: practitioners, architects, programmers, data scientists, researchers, startup entrepreneurs, and advanced students.

Pig Design Patterns

2014-04-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pradeep Pasupuleti

Analytics Big Data Data Analytics data data-engineering pig

Discover how to simplify Hadoop programming with Pig Design Patterns, helping you create innovative enterprise-level big data solutions. This book takes you step-by-step through practical design patterns for creating efficient data processing workflows with Apache Pig. What this Book will help me do Understand and implement fundamental data processing patterns with Pig. Master advanced Pig techniques for Big Data analytics. Learn to optimize Pig scripts for performance and scalability. Build end-to-end data processing solutions with real-world examples. Integrate Pig workflows into the broader Hadoop ecosystem. Author(s) Pradeep Pasupuleti is an experienced data engineer and software developer specializing in Big Data technologies. With extensive expertise in Hadoop and Pig, Pradeep shares valuable insights and practical techniques beginners and experts alike will appreciate. Who is it for? This book is perfect for software developers and data engineers working with Hadoop who want to streamline their workflow. It is ideal for professionals already familiar with Pig and Hadoop basics looking to advance. It also suits learners aiming to implement optimized data solutions effectively.

Hadoop For Dummies

2014-04-14 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dirk deRoos

Analytics Big Data Data Science data data-engineering

Let Hadoop For Dummies help harness the power of your data and rein in the information overload Big data has become big business, and companies and organizations of all sizes are struggling to find ways to retrieve valuable information from their massive data sets with becoming overwhelmed. Enter Hadoop and this easy-to-understand For Dummies guide. Hadoop For Dummies helps readers understand the value of big data, make a business case for using Hadoop, navigate the Hadoop ecosystem, and build and manage Hadoop applications and clusters. Explains the origins of Hadoop, its economic benefits, and its functionality and practical applications Helps you find your way around the Hadoop ecosystem, program MapReduce, utilize design patterns, and get your Hadoop cluster up and running quickly and easily Details how to use Hadoop applications for data mining, web analytics and personalization, large-scale text processing, data science, and problem-solving Shows you how to improve the value of your Hadoop cluster, maximize your investment in Hadoop, and avoid common pitfalls when building your Hadoop cluster From programmers challenged with building and maintaining affordable, scaleable data systems to administrators who must deal with huge volumes of information effectively and efficiently, this how-to has something to help you with Hadoop.

Think Bigger

2014-04-02 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mark Van Rijmenam

Big Data Cyber Security data data-engineering data-lake storage-repositories

Big data--the enormous amount of data that is created as virtually every movement, transaction, and choice we make becomes digitized--is revolutionizing business. Offering real-world insight and explanations, this book provides a roadmap for organizations looking to develop a profitable big data strategy...and reveals why it's not something they can leave to the I.T. department.

Sharing best practices from companies that have implemented a big data strategy including Walmart, InterContinental Hotel Group, Walt Disney, and Shell, Think Bigger covers the most important big data trends affecting organizations, as well as key technologies like Hadoop and MapReduce, and several crucial types of analyses. In addition, the book offers guidance on how to ensure security, and respect the privacy rights of consumers. It also examines in detail how big data is impacting specific industries--and where opportunities can be found.

Big data is changing the way businesses--and even governments--are operated and managed. Think Bigger is an essential resource for anyone who wants to ensure that their company isn't left in the dust.

Apache Hadoop™ YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop™ 2

2014-03-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Joseph Niemiec , Arun C. Murthy , Vinod Kumar Vavilapalli , Doug Eadline , Jeff Markham

Big Data Java data data-engineering yarn

“This book is a critically needed resource for the newly released Apache Hadoop 2.0, highlighting YARN as the significant breakthrough that broadens Hadoop beyond the MapReduce paradigm.” —From the Foreword by Raymie Stata, CEO of Altiscale The Insider’s Guide to Building Distributed, Big Data Applications with Apache Hadoop™ YARN Apache Hadoop is helping drive the Big Data revolution. Now, its data processing has been completely overhauled: Apache Hadoop YARN provides resource management at data center scale and easier ways to create distributed applications that process petabytes of data. And now in two Hadoop technical leaders show you how to develop new applications and adapt existing code to fully leverage these revolutionary advances. Apache Hadoop™ YARN, YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli demonstrate how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment. You’ll find many examples drawn from the authors’ cutting-edge experience—first as Hadoop’s earliest developers and implementers at Yahoo! and now as Hortonworks developers moving the platform forward and helping customers succeed with it. Coverage includes YARN’s goals, design, architecture, and components—how it expands the Apache Hadoop ecosystem Exploring YARN on a single node Administering YARN clusters and Capacity Scheduler Running existing MapReduce applications Developing a large-scale clustered YARN application Discovering new open source frameworks that run under YARN

Microsoft Big Data Solutions

2014-03-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by James Rowland-Jones , Brian Mitchell , Christopher Price , Dan Clark , Adam Jorgensen , John Welch

BI Big Data Cloud Computing Microsoft SQL data data-engineering google-bigquery

Tap the power of Big Data with Microsoft technologies Big Data is here, and Microsoft's new Big Data platform is a valuable tool to help your company get the very most out of it. This timely book shows you how to use HDInsight along with HortonWorks Data Platform for Windows to store, manage, analyze, and share Big Data throughout the enterprise. Focusing primarily on Microsoft and HortonWorks technologies but also covering open source tools, Microsoft Big Data Solutions explains best practices, covers on-premises and cloud-based solutions, and features valuable case studies. Best of all, it helps you integrate these new solutions with technologies you already know, such as SQL Server and Hadoop. Walks you through how to integrate Big Data solutions in your company using Microsoft's HDInsight Server, HortonWorks Data Platform for Windows, and open source tools Explores both on-premises and cloud-based solutions Shows how to store, manage, analyze, and share Big Data through the enterprise Covers topics such as Microsoft's approach to Big Data, installing and configuring HortonWorks Data Platform for Windows, integrating Big Data with SQL Server, visualizing data with Microsoft and HortonWorks BI tools, and more Helps you build and execute a Big Data plan Includes contributions from the Microsoft and HortonWorks Big Data product teams If you need a detailed roadmap for designing and implementing a fully deployed Big Data solution, you'll want Microsoft Big Data Solutions.

Optimizing Hadoop for MapReduce

2014-02-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Khaled Tannir

Big Data Cloud Computing data data-engineering mapreduce

"Optimizing Hadoop for MapReduce" is your comprehensive guide to getting the best performance out of your Hadoop-based big data processing jobs. With a focus on practical application rather than theory, this book delves into the nuances of MapReduce job design, execution, and optimization to help you harness the full power of this technology. What this Book will help me do Understand the internal workings of Hadoop MapReduce and how it executes jobs. Master key optimization techniques to improve Hadoop job efficiency and resource use. Learn advanced MapReduce programming concepts to handle complex data processing tasks. Analyze and monitor Hadoop job performance using practical tools and methods. Integrate best practices for scaling production workloads in a Hadoop cluster. Author(s) Khaled Tannir is a seasoned software engineer and an expert in distributed systems, big data, and cloud technologies. He has decades of experience designing and optimizing systems for high-performance data processing. Khaled's hands-on approach to explaining technical concepts ensures readers gain practical, applied knowledge that can be immediately implemented in real-world projects. Who is it for? This book is intended for developers, data engineers, and system architects who work with or are planning to work with Apache Hadoop. Ideal readers should have basic familiarity with Hadoop concepts and a foundational understanding of distributed systems. This book will benefit professionals looking to optimize their Hadoop-based applications or understand advanced usage of MapReduce. Whether you're aiming to improve your existing knowledge or implement high-performance data solutions, this book is tailored for you.

Pro Microsoft HDInsight: Hadoop on Windows

2014-02-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Debarchan Sarkar

Azure BI HDFS Microsoft PowerShell data data-engineering

Pro Microsoft HDInsight is a complete guide to deploying and using Apache Hadoop on the Microsoft Windows Azure Platforms. The information in this book enables you to process enormous volumes of structured as well as non-structured data easily using HDInsight, which is Microsoft's own distribution of Apache Hadoop. Furthermore, the blend of Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings available through Windows Azure lets you take advantage of Hadoop's processing power without the worry of creating, configuring, maintaining, or managing your own cluster. With the data explosion that is soon to happen, the open source Apache Hadoop Framework is gaining traction, and it benefits from a huge ecosystem that has risen around the core functionalities of the Hadoop distributed file system (HDFS™) and Hadoop Map Reduce. Pro Microsoft HDInsight equips you with the knowledge, confidence, and technique to configure and manage this ecosystem on Windows Azure. The book is an excellent choice for anyone aspiring to be a data scientist or data engineer, putting you a step ahead in the data mining field. Guides you through installation and configuration of an HDInsight cluster on Windows Azure Provides clear examples of configuring and executing Map Reduce jobs Helps you consume data and diagnose errors from the Windows Azure HDInsight Service What you'll learn Create and Manage HDInsight clusters on Windows Azure Understand the different HDInsight services and configuration files Develop and run Map Reduce jobs using .NET and PowerShell Consume data from client applications like Microsoft Excel and Power View Monitor job executions and logs Troubleshoot common problems Who this book is for Pro Microsoft HDInsight: Hadoop on Windows is an excellent choice for developers in the field of business intelligence and predictive analysis who want that extra edge in technology on Microsoft Windows and Windows Azure platforms. The book is for people who love to slice and dice data, and identify trends and patterns through analysis of data to help in creative and intelligent decision making.

2013 Data Science Salary Survey

2014-02-17 · O'Reilly Data Science Books O'Reilly Amazon

book

by Roger Magoulas , John King

Big Data Data Science SQL data data-science data-science-as-a-profession

What tools do successful data scientists and analysts use, and how much money do they make? We surveyed hundreds of attendees at the O'Reilly Strata Conferences in Santa Clara, California and New York to understand. Findings from the survey include: Average number of tools and median income for all respondents Distribution of responses by age, location, industry, and position Detailed analysis of tools used by respondents and correlation to their salaries - including by tool clusters (Hadoop, SQL/Excel, and other) Correlation of specialized big data tools usage and salary What tools should you be learning and using? Read this valuable report to gain insight from these potentially career-changing findings.

IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators

2014-02-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tetsuya Shimada , Robert Uleman , Oliver Brandt , Roger Rea , Bharath Devaraju , Peter Nicholls , Ankit Pasricha , John Thorson , Kevin Foster , Chris Howard , Chuck Ballard , Daniel Farrell , Sandra Tucker , Norbert Schulz

AI/ML Analytics Big Data HDFS IBM NLP SPSS Data Streaming data data-engineering infosphere

This IBM® Redbooks® publication describes visual development, visualization, adapters, analytics, and accelerators for IBM InfoSphere® Streams (V3), a key component of the IBM Big Data platform. Streams was designed to analyze data in motion, and can perform analysis on incredibly high volumes with high velocity, using a wide variety of analytic functions and data types. The Visual Development environment extends Streams Studio with drag-and-drop development, provides round tripping with existing text editors, and is ideal for rapid prototyping. Adapters facilitate getting data in and out of Streams, and V3 supports WebSphere MQ, Apache Hadoop Distributed File System, and IBM InfoSphere DataStage. Significant analytics include the native Streams Processing Language, SPSS Modeler analytics, Complex Event Processing, TimeSeries Toolkit for machine learning and predictive analytics, Geospatial Toolkit for location-based applications, and Annotation Query Language for natural language processing applications. Accelerators for Social Media Analysis and Telecommunications Event Data Analysis sample programs can be modified to build production level applications. Want to learn how to analyze high volumes of streaming data or implement systems requiring high performance across nodes in a cluster? Then this book is for you. Please note that the additional material referenced in the text is not available from IBM.

Data Just Right: Introduction to Large-Scale Data & Analytics

2013-12-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Michael Manoochehri

AI/ML Analytics Big Data BigQuery Cloud Computing Dashboard Hive Java NoSQL Pandas Python Redis +2 more

Making Big Data Work: Real-World Use Cases and Examples, Practical Code, Detailed Solutions Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. Until now, however, most books on “Big Data” have been little more than business polemics or product catalogs. is different: It’s a completely practical and indispensable guide for every Big Data decision-maker, implementer, and strategist. Data Just Right Michael Manoochehri, a former Google engineer and data hacker, writes for professionals who need practical solutions that can be implemented with limited resources and time. Drawing on his extensive experience, he helps you focus on building applications, rather than infrastructure, because that’s where you can derive the most value. Manoochehri shows how to address each of today’s key Big Data use cases in a cost-effective way by combining technologies in hybrid solutions. You’ll find expert approaches to managing massive datasets, visualizing data, building data pipelines and dashboards, choosing tools for statistical analysis, and more. Throughout, the author demonstrates techniques using many of today’s leading data analysis tools, including Hadoop, Hive, Shark, R, Apache Pig, Mahout, and Google BigQuery. Coverage includes Mastering the four guiding principles of Big Data success—and avoiding common pitfalls Emphasizing collaboration and avoiding problems with siloed data Hosting and sharing multi-terabyte datasets efficiently and economically “Building for infinity” to support rapid growth Developing a NoSQL Web app with Redis to collect crowd-sourced data Running distributed queries over massive datasets with Hadoop, Hive, and Shark Building a data dashboard with Google BigQuery Exploring large datasets with advanced visualization Implementing efficient pipelines for transforming immense amounts of data Automating complex processing with Apache Pig and the Cascading Java library Applying machine learning to classify, recommend, and predict incoming information Using R to perform statistical analysis on massive datasets Building highly efficient analytics workflows with Python and Pandas Establishing sensible purchasing strategies: when to build, buy, or outsource Previewing emerging trends and convergences in scalable data technologies and the evolving role of the Data Scientist

Programming Elastic MapReduce

2013-12-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Kevin Schmidt (Google) , Christopher Phillips

AI/ML AWS Amazon EMR Cloud Computing ELK Hive Java data data-engineering mapreduce

Although you don’t need a large computing infrastructure to process massive amounts of data with Apache Hadoop, it can still be difficult to get started. This practical guide shows you how to quickly launch data analysis projects in the cloud by using Amazon Elastic MapReduce (EMR), the hosted Hadoop framework in Amazon Web Services (AWS). Authors Kevin Schmidt and Christopher Phillips demonstrate best practices for using EMR and various AWS and Apache technologies by walking you through the construction of a sample MapReduce log analysis application. Using code samples and example configurations, you’ll learn how to assemble the building blocks necessary to solve your biggest data analysis problems. Get an overview of the AWS and Apache software tools used in large-scale data analysis Go through the process of executing a Job Flow with a simple log analyzer Discover useful MapReduce patterns for filtering and analyzing data sets Use Apache Hive and Pig instead of Java to build a MapReduce Job Flow Learn the basics for using Amazon EMR to run machine learning algorithms Develop a project cost model for using Amazon EMR and other AWS tools

Big Data Application Architecture Q&A: A Problem - Solution Approach

2013-12-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Himanshu Shah , Nitin Sawant

Analytics Big Data data data-engineering

Big Data Application Architecture Pattern Recipes provides an insight into heterogeneous infrastructures, databases, and visualization and analytics tools used for realizing the architectures of big data solutions. Its problem-solution approach helps in selecting the right architecture to solve the problem at hand. In the process of reading through these problems, you will learn harness the power of new big data opportunities which various enterprises use to attain real-time profits. Big Data Application Architecture Pattern Recipes answers one of the most critical questions of this time 'how do you select the best end-to-end architecture to solve your big data problem?'. The book deals with various mission critical problems encountered by solution architects, consultants, and software architects while dealing with the myriad options available for implementing a typical solution, trying to extract insight from huge volumes of data in real-time and across multiple relational and non-relational data types for clients from industries like retail, telecommunication, banking, and insurance. The patterns in this book provide the strong architectural foundation required to launch your next big data application. The architectures for realizing these opportunities are based on relatively less expensive and heterogeneous infrastructures compared to the traditional monolithic and hugely expensive options that exist currently. This book describes and evaluates the benefits of heterogeneity which brings with it multiple options of solving the same problem, evaluation of trade-offs and validation of 'fitness-for-purpose' of the solution. What you'll learn Major considerations in building a big data solution Big data application architectures problems for specific industries What are the components one needs to build and end-to-end big data solution? Does one really need a real-time big data solution or an off-line analytics batch solution? What are the operations and support architectures for a big data solution? What are the scalability considerations, and options for a Hadoop installation? Who this book is for CIOs, CTOs, enterprise architects, and software architects Consultants, solution architects, and information management (IM) analysts who want to architect a big data solution for their enterprise

talk-data.com

Activity Trend

Top Events

Top Speakers

Hadoop in Practice, Second Edition

Getting Started with Impala

Using Flume

Pro Apache Hadoop, Second Edition

Cloudera Administration Handbook

Understanding Big Data Scalability: Big Data Scalability Series, Part I

Google BigQuery Analytics

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives

Pig Design Patterns

Hadoop For Dummies

Think Bigger

Apache Hadoop™ YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop™ 2

Microsoft Big Data Solutions

Optimizing Hadoop for MapReduce

Pro Microsoft HDInsight: Hadoop on Windows

2013 Data Science Salary Survey

IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators

Data Just Right: Introduction to Large-Scale Data & Analytics

Programming Elastic MapReduce

Big Data Application Architecture Q&A: A Problem - Solution Approach