talk-data.com talk-data.com

Topic

HDFS

Hadoop Distributed File System (HDFS)

distributed_storage big_data hadoop

46

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: O'Reilly Data Engineering Books ×
SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform

Use this guide to one of SQL Server 2019’s most impactful features—Big Data Clusters. You will learn about data virtualization and data lakes for this complete artificial intelligence (AI) and machine learning (ML) platform within the SQL Server database engine. You will know how to use Big Data Clusters to combine large volumes of streaming data for analysis along with data stored in a traditional database. For example, you can stream large volumes of data from Apache Spark in real time while executing Transact-SQL queries to bring in relevant additional data from your corporate, SQL Server database. Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark. Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis. What You Will Learn Install, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments Analyze large volumes of data directly from SQL Server and/or Apache Spark Manage data stored in HDFS from SQL Server as if it wererelational data Implement advanced analytics solutions through machine learning and AI Expose different data sources as a single logical source using data virtualization Who This Book Is For Data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environments

IBM Spectrum Scale CSI Driver for Container Persistent Storage

IBM® Spectrum Scale is a proven, scalable, high-performance data and file management solution. It provides world-class storage management with extreme scalability, flash accelerated performance, automatic policy-based storage that has tiers of flash through disk to tape. It also provides support for various protocols, such as NFS, SMB, Object, HDFS, and iSCSI. Containers can leverage the performance, information lifecycle management (ILM), scalability, and multisite data management to give the full flexibility on storage as they experience on the runtime. Container adoption is increasing in all industries, and they sprawl across multiple nodes on a cluster. The effective management of containers is necessary because their number will probably reach a far greater number than virtual machines today. Kubernetes is the standard container management platform currently being used. Data management is of ultimate importance, and often is forgotten because the first workloads containerized are ephemeral. For data management, many drivers with different specifications were available. A specification named Container Storage Interface (CSI) was created and is now adopted by all major Container Orchestrator Systems available. Although other container orchestration systems exist, Kubernetes became the standard framework for container management. It is a very flexible open source platform used as the base for most cloud providers and software companies' container orchestration systems. Red Hat OpenShift is one of the most reliable enterprise-grade container orchestration systems based on Kubernetes, designed and optimized to easily deploy web applications and services. OpenShift enables developers to focus on the code, while the platform takes care of all of the complex IT operations and processes. This IBM Redbooks® publication describes how the CSI Driver for IBM file storage enables IBM Spectrum® Scale to be used as persistent storage for stateful applications running in Kubernetes clusters. Through the Container Storage Interface Driver for IBM file storage, Kubernetes persistent volumes (PVs) can be provisioned from IBM Spectrum Scale. Therefore, the containers can be used with stateful microservices, such as database applications (MongoDB, PostgreSQL, and so on).

SQL Server Big Data Clusters: Early First Edition Based on Release Candidate 1

Get a head-start on learning one of SQL Server 2019’s latest and most impactful features—Big Data Clusters—that combines large volumes of non-relational data for analysis along with data stored relationally inside a SQL Server database. This book provides a first look at Big Data Clusters based upon SQL Server 2019 Release Candidate 1. Start now and get a jump on your competition in learning this important new feature. Big Data Clusters is a feature set covering data virtualization, distributed computing, and relational databases and provides a complete AI platform across the entire cluster environment. This book shows you how to deploy, manage, and use Big Data Clusters. For example, you will learn how to combine data stored on the HDFS file system together with data stored inside the SQL Server instances that make up the Big Data Cluster. Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019 using Release Candidate 1. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark. Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis. What You Will Learn Install, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments Analyze large volumes of data directly from SQL Server and/or Apache Spark Manage data stored in HDFS from SQL Server as if it were relational data Implement advanced analytics solutions through machine learning and AI Expose different data sources as a single logical source using data virtualization Who This Book Is For For data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environment

SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning

Get up to speed on the game-changing developments in SQL Server 2019. No longer just a database engine, SQL Server 2019 is cutting edge with support for machine learning (ML), big data analytics, Linux, containers, Kubernetes, Java, and data virtualization to Azure. This is not a book on traditional database administration for SQL Server. It focuses on all that is new for one of the most successful modernized data platforms in the industry. It is a book for data professionals who already know the fundamentals of SQL Server and want to up their game by building their skills in some of the hottest new areas in technology. SQL Server 2019 Revealed begins with a look at the project's team goal to integrate the world of big data with SQL Server into a major product release. The book then dives into the details of key new capabilities in SQL Server 2019 using a “learn by example” approach for Intelligent Performance, security, mission-criticalavailability, and features for the modern developer. Also covered are enhancements to SQL Server 2019 for Linux and gain a comprehensive look at SQL Server using containers and Kubernetes clusters. The book concludes by showing you how to virtualize your data access with Polybase to Oracle, MongoDB, Hadoop, and Azure, allowing you to reduce the need for expensive extract, transform, and load (ETL) applications. You will then learn how to take your knowledge of containers, Kubernetes, and Polybase to build a comprehensive solution called Big Data Clusters, which is a marquee feature of 2019. You will also learn how to gain access to Spark, SQL Server, and HDFS to build intelligence over your own data lake and deploy end-to-end machine learning applications. What You Will Learn Implement Big Data Clusters with SQL Server, Spark, and HDFS Create a Data Hub with connections to Oracle, Azure, Hadoop, and other sources Combine SQL and Spark to build a machine learning platform for AI applications Boost your performance with no application changes using Intelligent Performance Increase security of your SQL Server through Secure Enclaves and Data Classification Maximize database uptime through online indexing and Accelerated Database Recovery Build new modern applications with Graph, ML Services, and T-SQL Extensibility with Java Improve your ability to deploy SQL Server on Linux Gain in-depth knowledge to run SQL Server with containers and Kubernetes Know all the new database engine features for performance, usability, and diagnostics Use the latest tools and methods to migrate your database to SQL Server 2019 Apply your knowledge of SQL Server 2019 to Azure Who This Book Is For IT professionals and developers who understand the fundamentals of SQL Server and wish to focus on learning about the new, modern capabilities of SQL Server 2019. The book is for those who want to learn about SQL Server 2019 and the new Big Data Clusters and AI feature set, support for machine learning and Java, how to run SQL Server with containers and Kubernetes, and increased capabilities around Intelligent Performance, advanced security, and high availability.

IBM Spectrum Scale: Big Data and Analytics Solution Brief

This IBM® Redguide™ publication describes big data and analytics deployments that are built on IBM Spectrum Scale™. IBM Spectrum Scale is a proven enterprise-level distributed file system that is a high-performance and cost-effective alternative to Hadoop Distributed File System (HDFS) for Hadoop analytics services. IBM Spectrum Scale includes NFS, SMB, and Object services and meets the performance that is required by many industry workloads, such as technical computing, big data, analytics, and content management. IBM Spectrum Scale provides world-class, web-based storage management with extreme scalability, flash accelerated performance, and automatic policy-based storage tiering from flash through disk to the cloud, which reduces storage costs up to 90% while improving security and management efficiency in cloud, big data, and analytics environments. This Redguide publication is intended for technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing Hadoop analytics services and are interested in learning about the benefits of the use of IBM Spectrum Scale as an alternative to HDFS.

Hands-On Big Data Analytics with PySpark

Dive into the exciting world of big data analytics with 'Hands-On Big Data Analytics with PySpark'. This practical guide offers you the tools and knowledge to tackle massive datasets using PySpark. By exploring real-world examples, you'll learn to unleash the power of distributed systems to analyze and manipulate data at scale. What this Book will help me do Master using PySpark to handle large and complex datasets efficiently and effectively. Develop skills to optimize Spark programs using best practices like reducing shuffle operations. Learn to set up a PySpark environment, process data from platforms like HDFS, Hive, and S3. Enhance your data analytics capabilities by implementing powerful SQL queries and data visualizations. Understand testing and debugging techniques to build reliable, production-quality data pipelines. Author(s) Authored by Rudy Lai and Bartłomiej Potaczek, both seasoned data engineers and authors in the big data field. Rudy and Bartłomiej bring their extensive experience working with distributed systems and scalable data architectures into this book. Their approach is hands-on, focusing on real-world applications and best practices. Who is it for? This book is tailored for data scientists, engineers, and developers eager to advance their big data analytics capabilities. Whether you're new to big data or experienced with other analytics frameworks, this book will equip you with practical knowledge to utilize PySpark for scalable data solutions.

Integration of IBM Aspera Sync with IBM Spectrum Scale: Protecting and Sharing Files Globally

Economic globalization requires data to be available globally. With most data stored in file systems, solutions to make this data globally available become more important. Files that are in file systems can be protected or shared by replicating these files to another file system that is in a remote location. The remote location might be just around the corner or in a different country. Therefore, the techniques that are used to protect and share files must account for long distances and slow and unreliable wide area network (WAN) connections. IBM® Spectrum Scale is a scalable clustered file system that can be used to store all kinds of unstructured data. It provides open data access by way of Network File System (NFS); Server Message Block (SMB); POSIX Object Storage APIs, such as S3 and OpenStack Swift; and the Hadoop Distributed File System (HDFS) for accessing and sharing data. The IBM Aspera® file transfer solution (IBM Aspera Sync) provides predictable and reliable data transfer across large distance for small and large files. The combination of both can be used for global sharing and protection of data. This IBM Redpaper™ publication describes how IBM Aspera Sync can be used to protect and share data that is stored in IBM Spectrum™ Scale file systems across large distances of several hundred to thousands of miles. We also explain the integration of IBM Aspera Sync with IBM Spectrum Scale™ and differentiate it from solutions that are built into IBM Spectrum Scale for protection and sharing. We also describe different use cases for IBM Aspera Sync with IBM Spectrum Scale.

Mastering Hadoop 3

"Mastering Hadoop 3" is your in-depth guide to understanding and mastering the advanced features of the Hadoop ecosystem. With a focus on distributed computing and data processing, this book covers essential tools such as YARN, MapReduce, and Apache Spark to help you build scalable, efficient data pipelines. What this Book will help me do Gain a comprehensive understanding of Hadoop Distributed File System (HDFS) and YARN for effective resource management. Master data processing with MapReduce and learn to integrate with real-time processing engines like Spark and Flink. Develop and secure enterprise-grade Hadoop-based data pipelines by implementing robust security and governance measures. Explore techniques for batch data processing, data modeling, and designing applications tailored for Hadoop environments. Understand best practices for optimizing and troubleshooting Hadoop clusters for enhanced performance and reliability. Author(s) The authors, including None Wong, None Singh, and None Kumar, bring together years of experience in big data engineering, distributed systems, and enterprise application development. They aim to provide a clear pathway to mastering Hadoop ecosystem tools. Who is it for? This book is ideal for budding big data professionals who have some familiarity with Java and basic Hadoop concepts and wish to elevate their expertise. If you're a Hadoop career practitioner keen to expand your understanding of the ecosystem's advanced capabilities or a professional looking to implement Hadoop in organizational workflows, this book is well-suited for you.

Apache Hadoop 3 Quick Start Guide

Dive into the world of distributed data processing with the 'Apache Hadoop 3 Quick Start Guide.' This comprehensive resource equips you with the knowledge needed to handle large datasets effectively using Apache Hadoop. Learn how to set up and configure Hadoop, work with its core components, and explore its powerful ecosystem tools. What this Book will help me do Understand the fundamental concepts of Apache Hadoop, including HDFS, MapReduce, and YARN, and use them to store and process large datasets. Set up and configure Hadoop 3 in both developer and production environments to suit various deployment needs. Gain hands-on experience with Hadoop ecosystem tools like Hive, Kafka, and Spark to enhance your big data processing capabilities. Learn to manage, monitor, and troubleshoot Hadoop clusters efficiently to ensure smooth operations. Analyze real-time streaming data with tools like Apache Storm and perform advanced data analytics using Apache Spark. Author(s) The author of this guide, Vijay Karambelkar, brings years of experience working with big data technologies and Apache Hadoop in real-world applications. With a passion for teaching and simplifying complex topics, Vijay has compiled his expertise to help learners confidently approach Hadoop 3. His detailed, example-driven approach makes this book a practical resource for aspiring data professionals. Who is it for? This book is ideal for software developers, data engineers, and IT professionals who aspire to dive into the field of big data. If you're new to Apache Hadoop or looking to upgrade your skills to include version 3, this guide is for you. A basic understanding of Java programming is recommended to make the most of the topics covered. Embark on this journey to enhance your career in data-intensive industries.

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies. Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing. Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard. What You’ll Learn Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing Turbocharge Spark with Alluxio, a distributed in-memory storage platform Deploy big data in the cloud using Cloudera Director Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard Who This Book Is For BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics

Big Data Analytics with Hadoop 3

Big Data Analytics with Hadoop 3 is your comprehensive guide to understanding and leveraging the power of Apache Hadoop for large-scale data processing and analytics. Through practical examples, it introduces the tools and techniques necessary to integrate Hadoop with other popular frameworks, enabling efficient data handling, processing, and visualization. What this Book will help me do Understand the foundational components and features of Apache Hadoop 3 such as HDFS, YARN, and MapReduce. Gain the ability to integrate Hadoop with programming languages like Python and R for data analysis. Learn the skills to utilize tools such as Apache Spark and Apache Flink for real-time data analytics within the Hadoop ecosystem. Develop expertise in setting up a Hadoop cluster and performing analytics in cloud environments such as AWS. Master the process of building practical big data analytics pipelines for end-to-end data processing. Author(s) Sridhar Alla is a seasoned big data professional with extensive industry experience in building and deploying scalable big data analytics solutions. Known for his expertise in Hadoop and related ecosystems, Sridhar combines technical depth with clear communication in his writing, providing practical insights and hands-on knowledge. Who is it for? This book is tailored for data professionals, software engineers, and data scientists looking to expand their expertise in big data analytics using Hadoop 3. Whether you're an experienced developer or new to the big data ecosystem, this book provides the step-by-step guidance and practical examples needed to advance your skills and achieve your analytical goals.

IBM Spectrum Scale Best Practices for Genomics Medicine Workloads

Advancing the science of medicine by targeting a disease more precisely with treatment specific to each patient relies on access to that patient's genomics information and the ability to process massive amounts of genomics data quickly. Although genomics data is becoming a critical source for precision medicine, it is expected to create an expanding data ecosystem. Therefore, hospitals, genome centers, medical research centers, and other clinical institutes need to explore new methods of storing, accessing, securing, managing, sharing, and analyzing significant amounts of data. Healthcare and life sciences organizations that are running data-intensive genomics workloads on an IT infrastructure that lacks scalability, flexibility, performance, management, and cognitive capabilities also need to modernize and transform their infrastructure to support current and future requirements. IBM® offers an integrated solution for genomics that is based on composable infrastructure. This solution enables administrators to build an IT environment in a way that disaggregates the underlying compute, storage, and network resources. Such a composable building block based solution for genomics addresses the most complex data management aspect and allows organizations to store, access, manage, and share huge volumes of genome sequencing data. IBM Spectrum™ Scale is software-defined storage that is used to manage storage and provide massive scale, a global namespace, and high-performance data access with many enterprise features. IBM Spectrum Scale™ is used in clustered environments, provides unified access to data via file protocols (POSIX, NFS, and SMB) and object protocols (Swift and S3), and supports analytic workloads via HDFS connectors. Deploying IBM Spectrum Scale and IBM Elastic Storage™ Server (IBM ESS) as a composable storage building block in a Genomics Next Generation Sequencing deployment offers key benefits of performance, scalability, analytics, and collaboration via multiple protocols. This IBM Redpaper™ publication describes a composable solution with detailed architecture definitions for storage, compute, and networking services for genomics next generation sequencing that enable solution architects to benefit from tried-and-tested deployments, to quickly plan and design an end-to-end infrastructure deployment. The preferred practices and fully tested recommendations described in this paper are derived from running GATK Best Practices work flow from the Broad Institute. The scenarios provide all that is required, including ready-to-use configuration and tuning templates for the different building blocks (compute, network, and storage), that can enable simpler deployment and that can enlarge the level of assurance over the performance for genomics workloads. The solution is designed to be elastic in nature, and the disaggregation of the building blocks allows IT administrators to easily and optimally configure the solution with maximum flexibility. The intended audience for this paper is technical decision makers, IT architects, deployment engineers, and administrators who are working in the healthcare domain and who are working on genomics-based workloads.

Hadoop 2.x Administration Cookbook

Gain mastery over managing and maintaining large Apache Hadoop clusters with the Hadoop 2.x Administration Cookbook. This book provides practical step-by-step recipes guiding you to efficiently set up, optimize, and troubleshoot Hadoop clusters, ensuring high availability, security, and optimal performance in your data operations. What this Book will help me do Successfully set up and deploy an operational Hadoop 2.x cluster suitable for large-scale data operations. Effectively monitor and maintain Hadoop's HDFS, YARN, and MapReduce systems for optimized performance. Plan, configure, and enhance cluster availability using Zookeeper and Journal Node strategies. Develop workflows and manage data ingestion processes with tools like Flume and Oozie. Secure, troubleshoot, and optimize Hadoop environments to meet enterprise and operational standards. Author(s) Aman Singh is an experienced Hadoop administrator with years of hands-on experience managing robust and efficient Hadoop clusters. Aman has a deep understanding of the practical challenges faced in this field and a talent for breaking down complex topics into actionable steps. Through clear, problem-oriented language, Aman helps readers achieve fluency in Hadoop administration. Who is it for? This book is ideal for system administrators or IT professionals who have a foundational understanding of Hadoop and aim to strengthen their administrative skills. It is especially beneficial for experienced Hadoop administrators looking for a quick and practical reference guide to master cluster management. Whether you're working in a large enterprise or exploring Hadoop ecosystems for personal development, you'll find this book invaluable.

Sams Teach Yourself Hadoop in 24 Hours

Apache Hadoop is the technology at the heart of the Big Data revolution, and Hadoop skills are in enormous demand. Now, in just 24 lessons of one hour or less, you can learn all the skills and techniques you'll need to deploy each key component of a Hadoop platform in your local environment or in the cloud, building a fully functional Hadoop cluster and using it with real programs and datasets. Each short, easy lesson builds on all that's come before, helping you master all of Hadoop's essentials, and extend it to meet your unique challenges. Apache Hadoop in 24 Hours, Sams Teach Yourself covers all this, and much more: Understanding Hadoop and the Hadoop Distributed File System (HDFS) Importing data into Hadoop, and process it there Mastering basic MapReduce Java programming, and using advanced MapReduce API concepts Making the most of Apache Pig and Apache Hive Implementing and administering YARN Taking advantage of the full Hadoop ecosystem Managing Hadoop clusters with Apache Ambari Working with the Hadoop User Environment (HUE) Scaling, securing, and troubleshooting Hadoop environments Integrating Hadoop into the enterprise Deploying Hadoop in the cloud Getting started with Apache Spark Step-by-step instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Hadoop to solve a wide spectrum of Big Data problems.

Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

The Complete Guide to Data Science with Hadoop—For Technical Professionals, Businesspeople, and Students Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials. Practical Data Science with Hadoop® and Spark The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization. Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP). This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives. Learn What data science is, how it has evolved, and how to plan a data science career How data volume, variety, and velocity shape data science use cases Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark Data importation with Hive and Spark Data quality, preprocessing, preparation, and modeling Visualization: surfacing insights from huge data sets Machine learning: classification, regression, clustering, and anomaly detection Algorithms and Hadoop tools for predictive modeling Cluster analysis and similarity functions Large-scale anomaly detection NLP: applying data science to human language

Expert Hadoop® Administration

The Comprehensive, Up-to-Date Apache Hadoop Administration Handbook and Reference “Sam Alapati has worked with production Hadoop clusters for six years. His unique depth of experience has enabled him to write the go-to resource for all administrators looking to spec, size, expand, and secure production Hadoop clusters of any size.” –Paul Dix, Series Editor In leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples. Expert Hadoop® Administration, Alapati demystifies complex Hadoop environments, helping you understand exactly what happens behind the scenes when you administer your cluster. You’ll gain unprecedented insight as you walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes. The high-value administration skills you learn here will be indispensable no matter what Hadoop distribution you use or what Hadoop applications you run. Understand Hadoop’s architecture from an administrator’s standpoint Create simple and fully distributed clusters Run MapReduce and Spark applications in a Hadoop cluster Manage and protect Hadoop data and high availability Work with HDFS commands, file permissions, and storage management Move data, and use YARN to allocate resources and schedule jobs Manage job workflows with Oozie and Hue Secure, monitor, log, and optimize Hadoop Benchmark and troubleshoot Hadoop

Programming Pig, 2nd Edition

For many organizations, Hadoop is the first step for dealing with massive amounts of data. The next step? Processing and analyzing datasets with the Apache Pig scripting platform. With Pig, you can batch-process data without having to create a full-fledged application, making it easy to experiment with new datasets. Updated with use cases and programming examples, this second edition is the ideal learning tool for new and experienced users alike. You’ll find comprehensive coverage on key features such as the Pig Latin scripting language and the Grunt shell. When you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig. Delve into Pig’s data model, including scalar and complex data types Write Pig Latin scripts to sort, group, join, project, and filter your data Use Grunt to work with the Hadoop Distributed File System (HDFS) Build complex data processing pipelines with Pig’s macros and modularity features Embed Pig Latin in Python for iterative processing and other advanced tasks Use Pig with Apache Tez to build high-performance batch and interactive data processing applications Create your own load and store functions to handle data formats and storage mechanisms

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Learn how to use the Apache Hadoop projects, including MapReduce, HDFS, Apache Hive, Apache HBase, Apache Kafka, Apache Mahout, and Apache Solr. From setting up the environment to running sample applications each chapter in this book is a practical tutorial on using an Apache Hadoop ecosystem project. While several books on Apache Hadoop are available, most are based on the main projects, MapReduce and HDFS, and none discusses the other Apache Hadoop ecosystem projects and how they all work together as a cohesive big data development platform. What You Will Learn: Set up the environment in Linux for Hadoop projects using Cloudera Hadoop Distribution CDH 5 Run a MapReduce job Store data with Apache Hive, and Apache HBase Index data in HDFS with Apache Solr Develop a Kafka messaging system Stream Logs to HDFS with Apache Flume Transfer data from MySQL database to Hive, HDFS, and HBase with Sqoop Create a Hive table over Apache Solr Develop a Mahout User Recommender System Who This Book Is For: Apache Hadoop developers. Pre-requisite knowledge of Linux and some knowledge of Hadoop is required.

Hadoop: Data Processing and Modelling

Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets About This Book Conquer the mountain of data using Hadoop 2.X tools The authors succeed in creating a context for Hadoop and its ecosystem Hands-on examples and recipes giving the bigger picture and helping you to master Hadoop 2.X data processing platforms Overcome the challenging data processing problems using this exhaustive course with Hadoop 2.X Who This Book Is For This course is for Java developers, who know scripting, wanting a career shift to Hadoop - Big Data segment of the IT industry. So if you are a novice in Hadoop or an expert, this book will make you reach the most advanced level in Hadoop 2.X. What You Will Learn Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer Installing and maintaining Hadoop 2.X cluster and its ecosystem Advanced Data Analysis using the Hive, Pig, and Map Reduce programs Machine learning principles with libraries such as Mahout and Batch and Stream data processing using Apache Spark Understand the changes involved in the process in the move from Hadoop 1.0 to Hadoop 2.0 Dive into YARN and Storm and use YARN to integrate Storm with Hadoop Deploy Hadoop on Amazon Elastic MapReduce and Discover HDFS replacements and learn about HDFS Federation In Detail As Marc Andreessen has said "Data is eating the world," which can be witnessed today being the age of Big Data, businesses are producing data in huge volumes every day and this rise in tide of data need to be organized and analyzed in a more secured way. With proper and effective use of Hadoop, you can build new-improved models, and based on that you will be able to make the right decisions. The first module, Hadoop beginners Guide will walk you through on understanding Hadoop with very detailed instructions and how to go about using it. Commands are explained using sections called "What just happened" for more clarity and understanding. The second module, Hadoop Real World Solutions Cookbook, 2nd edition, is an essential tutorial to effectively implement a big data warehouse in your business, where you get detailed practices on the latest technologies such as YARN and Spark. Big data has become a key basis of competition and the new waves of productivity growth. Hence, once you get familiar with the basics and implement the end-to-end big data use cases, you will start exploring the third module, Mastering Hadoop. So, now the question is if you need to broaden your Hadoop skill set to the next level after you nail the basics and the advance concepts, then this course is indispensable. When you finish this course, you will be able to tackle the real-world scenarios and become a big data expert using the tools and the knowledge based on the various step-by-step tutorials and recipes. Style and approach This course has covered everything right from the basic concepts of Hadoop till you master the advance mechanisms to become a big data expert. The goal here is to help you learn the basic essentials using the step-by-step tutorials and from there moving toward the recipes with various real-world solutions for you. It covers all the important aspects of Hadoop from system designing and configuring Hadoop, machine learning principles with various libraries with chapters illustrated with code fragments and schematic diagrams. This is a compendious course to explore Hadoop from the basics to the most advanced techniques available in Hadoop 2.X.

Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL

Re-architect relational applications to NoSQL, integrate relational database management systems with the Hadoop ecosystem, and transform and migrate relational data to and from Hadoop components. This book covers the best-practice design approaches to re-architecting your relational applications and transforming your relational data to optimize concurrency, security, denormalization, and performance. Winner of IBM's 2012 Gerstner Award for his implementation of big data and data warehouse initiatives and author of Practical Hadoop Security, author Bhushan Lakhe walks you through the entire transition process. First, he lays out the criteria for deciding what blend of re-architecting, migration, and integration between RDBMS and HDFS best meets your transition objectives. Then he demonstrates how to design your transition model. Lakhe proceeds to cover the selection criteria for ETL tools, the implementation steps for migration with SQOOP- and Flume-based data transfers, and transition optimization techniques for tuning partitions, scheduling aggregations, and redesigning ETL. Finally, he assesses the pros and cons of data lakes and Lambda architecture as integrative solutions and illustrates their implementation with real-world case studies. Hadoop/NoSQL solutions do not offer by default certain relational technology features such as role-based access control, locking for concurrent updates, and various tools for measuring and enhancing performance. Practical Hadoop Migration shows how to use open-source tools to emulate such relational functionalities in Hadoop ecosystem components. What You'll Learn Decide whether you should migrate your relational applications to big data technologies or integrate them Transition your relational applications to Hadoop/NoSQL platforms in terms of logical design and physical implementation Discover RDBMS-to-HDFS integration, data transformation, and optimization techniques Consider when to use Lambda architecture and data lake solutions Select and implement Hadoop-based components and applications to speed transition, optimize integrated performance, and emulate relational functionalities Who This Book Is For Database developers, database administrators, enterprise architects, Hadoop/NoSQL developers, and IT leaders. Its secondary readership is project and program managers and advanced students of database and management information systems.