O'Reilly Data Engineering Books

Big Data Analytics

2016-09-28 O'Reilly Amazon

book

Aravind Nallan , Venkat Ankam

data data-engineering apache-spark AI/ML Analytics Big Data

Dive into the world of big data with "Big Data Analytics: Real Time Analytics Using Apache Spark and Hadoop." This comprehensive guide introduces readers to the fundamentals and practical applications of Apache Spark and Hadoop, covering essential topics like Spark SQL, DataFrames, structured streaming, and more. Learn how to harness the power of real-time analytics and big data tools effectively. What this Book will help me do Master the key components of Apache Spark and Hadoop ecosystems, including Spark SQL and MapReduce. Gain an understanding of DataFrames, DataSets, and structured streaming for seamless data handling. Develop skills in real-time analytics using Spark Streaming and technologies like Kafka and HBase. Learn to implement machine learning models using Spark's MLlib and ML Pipelines. Explore graph analytics with GraphX and leverage data visualization tools like Jupyter and Zeppelin. Author(s) Venkat Ankam, an expert in big data technologies, has years of experience working with Apache Hadoop and Spark. As an educator and technical consultant, Venkat has enabled numerous professionals to gain critical insights into big data ecosystems. With a pragmatic approach, his writings aim to guide readers through complex systems in a structured and easy-to-follow manner. Who is it for? This book is perfect for data analysts, data scientists, software architects, and programmers aiming to expand their knowledge of big data analytics. Readers should ideally have a basic programming background in languages like Python, Scala, R, or SQL. Prior hands-on experience with big data environments is not necessary but is an added advantage. This guide is created to cater to a range of skill levels, from beginners to intermediate learners.

Hadoop: Data Processing and Modelling

2016-08-31 O'Reilly Amazon

book

Sandeep Karanth , Tanmay Deshpande , Garry Turkington

data data-engineering Hadoop AI/ML Big Data DWH

Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets About This Book Conquer the mountain of data using Hadoop 2.X tools The authors succeed in creating a context for Hadoop and its ecosystem Hands-on examples and recipes giving the bigger picture and helping you to master Hadoop 2.X data processing platforms Overcome the challenging data processing problems using this exhaustive course with Hadoop 2.X Who This Book Is For This course is for Java developers, who know scripting, wanting a career shift to Hadoop - Big Data segment of the IT industry. So if you are a novice in Hadoop or an expert, this book will make you reach the most advanced level in Hadoop 2.X. What You Will Learn Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer Installing and maintaining Hadoop 2.X cluster and its ecosystem Advanced Data Analysis using the Hive, Pig, and Map Reduce programs Machine learning principles with libraries such as Mahout and Batch and Stream data processing using Apache Spark Understand the changes involved in the process in the move from Hadoop 1.0 to Hadoop 2.0 Dive into YARN and Storm and use YARN to integrate Storm with Hadoop Deploy Hadoop on Amazon Elastic MapReduce and Discover HDFS replacements and learn about HDFS Federation In Detail As Marc Andreessen has said "Data is eating the world," which can be witnessed today being the age of Big Data, businesses are producing data in huge volumes every day and this rise in tide of data need to be organized and analyzed in a more secured way. With proper and effective use of Hadoop, you can build new-improved models, and based on that you will be able to make the right decisions. The first module, Hadoop beginners Guide will walk you through on understanding Hadoop with very detailed instructions and how to go about using it. Commands are explained using sections called "What just happened" for more clarity and understanding. The second module, Hadoop Real World Solutions Cookbook, 2nd edition, is an essential tutorial to effectively implement a big data warehouse in your business, where you get detailed practices on the latest technologies such as YARN and Spark. Big data has become a key basis of competition and the new waves of productivity growth. Hence, once you get familiar with the basics and implement the end-to-end big data use cases, you will start exploring the third module, Mastering Hadoop. So, now the question is if you need to broaden your Hadoop skill set to the next level after you nail the basics and the advance concepts, then this course is indispensable. When you finish this course, you will be able to tackle the real-world scenarios and become a big data expert using the tools and the knowledge based on the various step-by-step tutorials and recipes. Style and approach This course has covered everything right from the basic concepts of Hadoop till you master the advance mechanisms to become a big data expert. The goal here is to help you learn the basic essentials using the step-by-step tutorials and from there moving toward the recipes with various real-world solutions for you. It covers all the important aspects of Hadoop from system designing and configuring Hadoop, machine learning principles with various libraries with chapters illustrated with code fragments and schematic diagrams. This is a compendious course to explore Hadoop from the basics to the most advanced techniques available in Hadoop 2.X.

Practical Hive: A Guide to Hadoop's Data Warehouse System

2016-08-27 O'Reilly Amazon

book

Scott Shaw , David Kjerrumgaard , Andreas François Vermeulen , Ankur Gupta

data data-engineering Hadoop Big Data DWH Hive

Dive into the world of SQL on Hadoop and get the most out of your Hive data warehouses. This book is your go-to resource for using Hive: authors Scott Shaw, Ankur Gupta, David Kjerrumgaard, and Andreas Francois Vermeulen take you through learning HiveQL, the SQL-like language specific to Hive, to analyze, export, and massage the data stored across your Hadoop environment. From deploying Hive on your hardware or virtual machine and setting up its initial configuration to learning how Hive interacts with Hadoop, MapReduce, Tez and other big data technologies, Practical Hive gives you a detailed treatment of the software. In addition, this book discusses the value of open source software, Hive performance tuning, and how to leverage semi-structured and unstructured data. What You Will Learn Install and configure Hive for new and existing datasets Perform DDL operations Execute efficient DML operations Use tables, partitions, buckets, and user-defined functions Discover performance tuning tips and Hive best practices Who This Book Is For Developers, companies, and professionals who deal with large amounts of data and could use software that can efficiently manage large volumes of input. It is assumed that readers have the ability to work with SQL.

Big Data War

2016-08-26 O'Reilly Amazon

book

Patrick H. Park

data data-engineering Analytics Big Data Data Analytics

This book mainly focuses on why data analytics fails in business. It provides an objective analysis and root causes of the phenomenon, instead of abstract criticism of utility of data analytics. The author, then, explains in detail on how companies can survive and win the global big data competition, based on actual cases of companies. Having established the execution and performance-oriented big data methodology based on over 10 years of experience in the field as an authority in big data strategy, the author identifies core principles of data analytics using case analysis of failures and successes of actual companies. Moreover, he endeavors to share with readers the principles regarding how innovative global companies became successful through utilization of big data. This book is a quintessential big data analytics, in which the author’s knowhow from direct and indirect experiences is condensed. How do we survive at this big data war in which Facebook in SNS, Amazon in e-commerce, Google in search, expand their platforms to other areas based on their respective distinct markets? The answer can be found in this book.

IBM Data Engine for Hadoop and Spark

2016-08-24 O'Reilly Amazon

book

Dino Quintero , Reinaldo Tetsuo Katahira , Aditya Gandakusuma Sutandyo , Nicolas Joly , Luis Bolinches

data data-engineering IBM Analytics Big Data Hadoop

This IBM® Redbooks® publication provides topics to help the technical community take advantage of the resilience, scalability, and performance of the IBM Power Systems™ platform to implement or integrate an IBM Data Engine for Hadoop and Spark solution for analytics solutions to access, manage, and analyze data sets to improve business outcomes. This book documents topics to demonstrate and take advantage of the analytics strengths of the IBM POWER8® platform, the IBM analytics software portfolio, and selected third-party tools to help solve customer's data analytic workload requirements. This book describes how to plan, prepare, install, integrate, manage, and show how to use the IBM Data Engine for Hadoop and Spark solution to run analytic workloads on IBM POWER8. In addition, this publication delivers documentation to complement available IBM analytics solutions to help your data analytic needs. This publication strengthens the position of IBM analytics and big data solutions with a well-defined and documented deployment model within an IBM POWER8 virtualized environment so that customers have a planned foundation for security, scaling, capacity, resilience, and optimization for analytics workloads. This book is targeted at technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) that are responsible for delivering analytics solutions and support on IBM Power Systems.

Sams Teach Yourself Apache Spark™ in 24 Hours

2016-08-17 O'Reilly Amazon

book

Jeffrey Aven

data data-engineering apache-spark AI/ML API Big Data

Apache Spark is a fast, scalable, and flexible open source distributed processing engine for big data systems and is one of the most active open source big data projects to date. In just 24 lessons of one hour or less, Sams Teach Yourself Apache Spark in 24 Hours helps you build practical Big Data solutions that leverage Spark’s amazing speed, scalability, simplicity, and versatility. This book’s straightforward, step-by-step approach shows you how to deploy, program, optimize, manage, integrate, and extend Spark–now, and for years to come. You’ll discover how to create powerful solutions encompassing cloud computing, real-time stream processing, machine learning, and more. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success. Whether you are a data analyst, data engineer, data scientist, or data steward, learning Spark will help you to advance your career or embark on a new career in the booming area of Big Data. Learn how to • Discover what Apache Spark does and how it fits into the Big Data landscape • Deploy and run Spark locally or in the cloud • Interact with Spark from the shell • Make the most of the Spark Cluster Architecture • Develop Spark applications with Scala and functional Python • Program with the Spark API, including transformations and actions • Apply practical data engineering/analysis approaches designed for Spark • Use Resilient Distributed Datasets (RDDs) for caching, persistence, and output • Optimize Spark solution performance • Use Spark with SQL (via Spark SQL) and with NoSQL (via Cassandra) • Leverage cutting-edge functional programming techniques • Extend Spark with streaming, R, and Sparkling Water • Start building Spark-based machine learning and graph-processing applications • Explore advanced messaging technologies, including Kafka • Preview and prepare for Spark’s next generation of innovations Instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Spark to solve a wide spectrum of Big Data problems.

In Search of Database Nirvana

2016-08-15 O'Reilly Amazon

book

Rohit Jain

data data-engineering search BI Big Data Hadoop

The database pendulum is in full swing. Ten years ago, web-scale companies began moving away from proprietary relational databases to handle big data use cases with NoSQL and Hadoop. Now, for a variety of reasons, the pendulum is swinging back toward SQL-based solutions. What many companies really want is a system that can handle all of their operational, OLTP, BI, and analytic workloads. Could such an all-in-one database exist? This O’Reilly report examines this quest for database nirvana, or what Gartner recently dubbed Hybrid Transaction/Analytical Processing (HTAP). Author Rohit Jain takes an in-depth look at the possibilities and the challenges for companies that long for a single query engine to rule them all. With this report, you’ll explore: The challenges of having one query engine support operational, BI, and analytical workloads Efforts to produce a query engine that supports multiple storage engines Attempts to support multiple data models with the same query engine Why an HTAP database engine needs to provide enterprise-caliber capabilities, including high availability, security, and manageability How to assess various options for meeting workload requirements with one database engine, or a combination of query and storage engines

Interactive Spark using PySpark

2016-08-15 O'Reilly Amazon

book

Benjamin Bengfort , Jenny Kim

data data-engineering apache-spark PySpark AI/ML Analytics

Apache Spark is an in-memory framework that allows data scientists to explore and interact with big data much more quickly than with Hadoop. Python users can work with Spark using an interactive shell called PySpark. Why is it important? PySpark makes the large-scale data processing capabilities of Apache Spark accessible to data scientists who are more familiar with Python than Scala or Java. This also allows for reuse of a wide variety of Python libraries for machine learning, data visualization, numerical analysis, etc. What you'll learn—and how you can apply it Compare the different components provided by Spark, and what use cases they fit. Learn how to use RDDs (resilient distributed datasets) with PySpark. Write Spark applications in Python and submit them to the cluster as Spark jobs. Get an introduction to the Spark computing framework. Apply this approach to a worked example to determine the most frequent airline delays in a specific month and year. This lesson is for you because… You're a data scientist, familiar with Python coding, who needs to get up and running with PySpark You're a Python developer who needs to leverage the distributed computing resources available on a Hadoop cluster, without learning Java or Scala first Prerequisites Familiarity with writing Python applications Some familiarity with bash command-line operations Basic understanding of how to use simple functional programming constructs in Python, such as closures, lambdas, maps, etc. Materials or downloads needed in advance Apache Spark This lesson is taken from by Jenny Kim and Benjamin Bengfort. Data Analytics with Hadoop

Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL

2016-08-10 O'Reilly Amazon

book

Bhushan Lakhe

data data-engineering Hadoop AWS Lambda Big Data Data Lake

Re-architect relational applications to NoSQL, integrate relational database management systems with the Hadoop ecosystem, and transform and migrate relational data to and from Hadoop components. This book covers the best-practice design approaches to re-architecting your relational applications and transforming your relational data to optimize concurrency, security, denormalization, and performance. Winner of IBM's 2012 Gerstner Award for his implementation of big data and data warehouse initiatives and author of Practical Hadoop Security, author Bhushan Lakhe walks you through the entire transition process. First, he lays out the criteria for deciding what blend of re-architecting, migration, and integration between RDBMS and HDFS best meets your transition objectives. Then he demonstrates how to design your transition model. Lakhe proceeds to cover the selection criteria for ETL tools, the implementation steps for migration with SQOOP- and Flume-based data transfers, and transition optimization techniques for tuning partitions, scheduling aggregations, and redesigning ETL. Finally, he assesses the pros and cons of data lakes and Lambda architecture as integrative solutions and illustrates their implementation with real-world case studies. Hadoop/NoSQL solutions do not offer by default certain relational technology features such as role-based access control, locking for concurrent updates, and various tools for measuring and enhancing performance. Practical Hadoop Migration shows how to use open-source tools to emulate such relational functionalities in Hadoop ecosystem components. What You'll Learn Decide whether you should migrate your relational applications to big data technologies or integrate them Transition your relational applications to Hadoop/NoSQL platforms in terms of logical design and physical implementation Discover RDBMS-to-HDFS integration, data transformation, and optimization techniques Consider when to use Lambda architecture and data lake solutions Select and implement Hadoop-based components and applications to speed transition, optimize integrated performance, and emulate relational functionalities Who This Book Is For Database developers, database administrators, enterprise architects, Hadoop/NoSQL developers, and IT leaders. Its secondary readership is project and program managers and advanced students of database and management information systems.

The Big Data Market

2016-07-19 O'Reilly Amazon

book

Aman Naimat

data data-engineering Big Data Data Science Hadoop Spark

Which companies have adopted technologies such as Hadoop and Spark, as well as data science in general? And which industries are lagging behind? This O’Reilly report provides the results of a unique, data-driven analysis of the market for big data products and technologies. Using eye-catching charts and visualizations, Spiderbook cofounder Aman Naimat highlights some surprising results from the analysis, such as: The relatively small number of companies using big data in production Industries that have embraced big data the most—and the least The amount of money spent on various big data use cases How many companies actually use “fast data” The results also reveal the geographical locations where companies have been quick to adopt big data, as well as the types of teams that use big data technology. In addition, Naimat takes you through the analysis process with Spiderbook’s graph-based machine-learning model. The company analyzed billions of publicly available documents, canvassed more than 500,000 companies, and searched the entire business internet to compile the most comprehensive results possible.

Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark

2016-06-13 O'Reilly Amazon

book

Zubair Nabi

data data-engineering apache-spark AI/ML Analytics AWS Lambda

Learn the right cutting-edge skills and knowledge to leverage Spark Streaming to implement a wide array of real-time, streaming applications. This book walks you through end-to-end real-time application development using real-world applications, data, and code. Taking an application-first approach, each chapter introduces use cases from a specific industry and uses publicly available datasets from that domain to unravel the intricacies of production-grade design and implementation. The domains covered in Pro Spark Streaming include social media, the sharing economy, finance, online advertising, telecommunication, and IoT. In the last few years, Spark has become synonymous with big data processing. DStreams enhance the underlying Spark processing engine to support streaming analysis with a novel micro-batch processing model. Pro Spark Streaming by Zubair Nabi will enable you to become a specialist of latency sensitive applications by leveraging the key features of DStreams, micro-batch processing, and functional programming. To this end, the book includes ready-to-deploy examples and actual code. Pro Spark Streaming will act as the bible of Spark Streaming. What You'll Learn Discover Spark Streaming application development and best practices Work with the low-level details of discretized streams Optimize production-grade deployments of Spark Streaming via configuration recipes and instrumentation using Graphite, collectd, and Nagios Ingest data from disparate sources including MQTT, Flume, Kafka, Twitter, and a custom HTTP receiver Integrate and couple with HBase, Cassandra, and Redis Take advantage of design patterns for side-effects and maintaining state across the Spark Streaming micro-batch model Implement real-time and scalable ETL using data frames, SparkSQL, Hive, and SparkR Use streaming machine learning, predictive analytics, and recommendations Mesh batch processing with stream processing via the Lambda architecture Who This Book Is For Data scientists, big data experts, BI analysts, and data architects.

Spark GraphX in Action

2016-06-13 O'Reilly Amazon

book

Michael Malak , Robin East

data data-engineering apache-spark AI/ML Analytics API

Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it interactively. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data. About the Technology GraphX is a powerful graph processing API for the Apache Spark analytics engine that lets you draw insights from large datasets. GraphX gives you unprecedented speed and capacity for running massively parallel and machine learning algorithms. About the Book Spark GraphX in Action begins with the big picture of what graphs can be used for. This example-based tutorial teaches you how to use GraphX interactively. You'll start with a crystal-clear introduction to building big data graphs from regular data, and then explore the problems and possibilities of implementing graph algorithms and architecting graph processing pipelines. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data. What's Inside Understanding graph technology Using the GraphX API Developing algorithms for big graphs Machine learning with graphs Graph visualization About the Reader Readers should be comfortable writing code. Experience with Apache Spark and Scala is not required. About the Authors Michael Malak has worked on Spark applications for Fortune 500 companies since early 2013. Robin East has worked as a consultant to large organizations for over 15 years and is a data scientist at Worldpay. Quotes Learn complex graph processing from two experienced authors…A comprehensive guide. - Gaurav Bhardwaj, 3Pillar Global The best resource to go from GraphX novice to expert in the least amount of time. - Justin Fister, PaperRater A must-read for anyone serious about large-scale graph data mining! - Antonio Magnaghi, OpenMail Reveals the awesome and elegant capabilities of working with linked data for large-scale datasets. - Sumit Pal, Independent consultant

Big Data

2016-06-07 O'Reilly Amazon

book

Rajkumar Buyya , Rodrigo N. Calheiros , Amir Vahid Dastjerdi

data data-engineering AI/ML Big Data Data Management Data Modelling

Big Data: Principles and Paradigms captures the state-of-the-art research on the architectural aspects, technologies, and applications of Big Data. The book identifies potential future directions and technologies that facilitate insight into numerous scientific, business, and consumer applications. To help realize Big Data’s full potential, the book addresses numerous challenges, offering the conceptual and technological solutions for tackling them. These challenges include life-cycle data management, large-scale storage, flexible processing infrastructure, data modeling, scalable machine learning, data analysis algorithms, sampling techniques, and privacy and ethical issues. Covers computational platforms supporting Big Data applications Addresses key principles underlying Big Data computing Examines key developments supporting next generation Big Data platforms Explores the challenges in Big Data computing and ways to overcome them Contains expert contributors from both academia and industry

Implementing an Optimized Analytics Solution on IBM Power Systems

2016-06-01 O'Reilly Amazon

book

Dino Quintero , Robert Simon , Reinaldo Tetsuo Katahira , Kanako Harada , Brian Yaeger , Antonio Moreira de Oliveira Neto

data data-engineering IBM Analytics Big Data Cyber Security

This IBM® Redbooks® publication addresses topics to use the virtualization strengths of the IBM POWER8® platform to solve clients' system resource utilization challenges and maximize systems' throughput and capacity. This book addresses performance tuning topics that will help answer clients' complex analytic workload requirements, help maximize systems' resources, and provide expert-level documentation to transfer the how-to-skills to the worldwide teams. This book strengthens the position of IBM Analytics and Big Data solutions with a well-defined and documented deployment model within a POWER8 virtualized environment, offering clients a planned foundation for security, scaling, capacity, resilience, and optimization for analytics workloads. This book is targeted toward technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing analytics solutions and support on IBM Power Systems™.

Apache Spark Machine Learning Blueprints

2016-05-30 O'Reilly Amazon

book

Alex Liu

data data-engineering apache-spark AI/ML Analytics Big Data

In 'Apache Spark Machine Learning Blueprints', you'll explore how to create sophisticated and scalable machine learning projects using Apache Spark. This project-driven guide covers practical applications including fraud detection, customer analysis, and recommendation engines, helping you leverage Spark's capabilities for advanced data science tasks. What this Book will help me do Learn to set up Apache Spark efficiently for machine learning projects, unlocking its powerful processing capabilities. Integrate Apache Spark with R for detailed analytical insights, empowering your decision-making processes. Create predictive models for use cases including customer scoring, fraud detection, and risk assessment with practical implementations. Understand and utilize Spark's parallel computing architecture for large-scale machine learning tasks. Develop and refine recommendation systems capable of handling large user bases and datasets using Spark. Author(s) Alex Liu is a seasoned data scientist and software developer specializing in machine learning and big data technology. With extensive experience in using Apache Spark for predictive analytics, Alex has successfully built and deployed scalable solutions across industries. Their teaching approach combines theory and practical insights, making cutting-edge technologies accessible and actionable. Who is it for? This book is ideal for data analysts, data scientists, and developers with a foundation in machine learning who are eager to apply their knowledge in big data contexts. If you have a basic familiarity with Apache Spark and its ecosystem, and you're looking to enhance your ability to build machine learning applications, this resource is for you. It's particularly valuable for those aiming to utilize Spark for extensive data operations and gain practical, project-based insights.

Streaming Architecture

2016-05-25 O'Reilly Amazon

book

Ellen Friedman , Ted Dunning

data data-engineering streaming-messaging streaming-architecture Analytics Flink

More and more data-driven companies are looking to adopt stream processing and streaming analytics. With this concise ebook, you’ll learn best practices for designing a reliable architecture that supports this emerging big-data paradigm. Authors Ted Dunning and Ellen Friedman (Real World Hadoop) help you explore some of the best technologies to handle stream processing and analytics, with a focus on the upstream queuing or message-passing layer. To illustrate the effectiveness of these technologies, this book also includes specific use cases. Ideal for developers and non-technical people alike, this book describes: Key elements in good design for streaming analytics, focusing on the essential characteristics of the messaging layer New messaging technologies, including Apache Kafka and MapR Streams, with links to sample code Technology choices for streaming analytics: Apache Spark Streaming, Apache Flink, Apache Storm, and Apache Apex How stream-based architectures are helpful to support microservices Specific use cases such as fraud detection and geo-distributed data streams Ted Dunning is Chief Applications Architect at MapR Technologies, and active in the open source community. He currently serves as VP for Incubator at the Apache Foundation, as a champion and mentor for a large number of projects, and as committer and PMC member of the Apache ZooKeeper and Drill projects. Ted is on Twitter as @ted_dunning. Ellen Friedman, a committer for the Apache Drill and Apache Mahout projects, is a solutions consultant and well-known speaker and author, currently writing mainly about big data topics. With a PhD in Biochemistry, she has years of experience as a research scientist and has written about a variety of technical topics. Ellen is on Twitter as @Ellen_Friedman.

Professional Hadoop

2016-05-23 O'Reilly Amazon

book

Benoy Antony , Cheryl Adams , Cazen Lee , Konstantin Boudnik , Branky Shao , Kai Sasaki

data data-engineering Hadoop Big Data Java Kafka

The professional's one-stop guide to this open-source, Java-based big data framework Professional Hadoop is the complete reference and resource for experienced developers looking to employ Apache Hadoop in real-world settings. Written by an expert team of certified Hadoop developers, committers, and Summit speakers, this book details every key aspect of Hadoop technology to enable optimal processing of large data sets. Designed expressly for the professional developer, this book skips over the basics of database development to get you acquainted with the framework's processes and capabilities right away. The discussion covers each key Hadoop component individually, culminating in a sample application that brings all of the pieces together to illustrate the cooperation and interplay that make Hadoop a major big data solution. Coverage includes everything from storage and security to computing and user experience, with expert guidance on integrating other software and more. Hadoop is quickly reaching significant market usage, and more and more developers are being called upon to develop big data solutions using the Hadoop framework. This book covers the process from beginning to end, providing a crash course for professionals needing to learn and apply Hadoop quickly. Configure storage, UE, and in-memory computing Integrate Hadoop with other programs including Kafka and Storm Master the fundamentals of Apache Big Top and Ignite Build robust data security with expert tips and advice Hadoop's popularity is largely due to its accessibility. Open-source and written in Java, the framework offers almost no barrier to entry for experienced database developers already familiar with the skills and requirements real-world programming entails. Professional Hadoop gives you the practical information and framework-specific skills you need quickly.

Big Data in Practice

2016-05-02 O'Reilly Amazon

book

Bernard Marr

data data-engineering Analytics Big Data Microsoft

The best-selling author of Big Data is back, this time with a unique and in-depth insight into how specific companies use big data. Big data is on the tip of everyone's tongue. Everyone understands its power and importance, but many fail to grasp the actionable steps and resources required to utilise it effectively. This book fills the knowledge gap by showing how major companies are using big data every day, from an up-close, on-the-ground perspective. From technology, media and retail, to sport teams, government agencies and financial institutions, learn the actual strategies and processes being used to learn about customers, improve manufacturing, spur innovation, improve safety and so much more. Organised for easy dip-in navigation, each chapter follows the same structure to give you the information you need quickly. For each company profiled, learn what data was used, what problem it solved and the processes put it place to make it practical, as well as the technical details, challenges and lessons learned from each unique scenario. Learn how predictive analytics helps Amazon, Target, John Deere and Apple understand their customers Discover how big data is behind the success of Walmart, LinkedIn, Microsoft and more Learn how big data is changing medicine, law enforcement, hospitality, fashion, science and banking Develop your own big data strategy by accessing additional reading materials at the end of each chapter

Apache Hive Cookbook

2016-04-29 O'Reilly Amazon

book

Saurabh Chauhan , Hanish Bansal , Shrey Mehrotra

data data-engineering Hadoop apache-hive Analytics Big Data

Apache Hive Cookbook is a comprehensive resource for mastering Apache Hive, a tool that bridges the gap between SQL and Big Data processing. Through guided recipes, you'll acquire essential skills in Hive query development, optimization, and integration with modern big data frameworks. What this Book will help me do Design efficient Hive query structures for big data analytics. Optimize data storage and query execution using partitions and buckets. Integrate Hive seamlessly with frameworks like Spark and Hadoop. Understand and utilize the HiveQL syntax to perform advanced analytical processing. Implement practical solutions to secure, maintain, and scale Hive environments. Author(s) Hanish Bansal, Saurabh Chauhan, and Shrey Mehrotra bring their extensive expertise in big data technologies and Hive to this cookbook. With years of practical experience and deep technical knowledge, they offer a collection of solutions and best practices that reflect real-world use cases. Their commitment to clarity and depth makes this book an invaluable resource for exploring Hive to its fullest potential. Who is it for? This book is perfect for data professionals, engineers, and developers looking to enhance their capabilities in big data analytics using Hive. It caters to those with a foundational understanding of big data frameworks and some familiarity with SQL. Whether you're planning to optimize data handling or integrate Hive with other data tools, this guide helps you achieve your goals. Step into the world of efficient data analytics with Apache Hive through structured learning paths.

Big Data

2016-04-27 O'Reilly Amazon

book

Fei Hu

data data-engineering Big Data Data Management Cyber Security

Big Data: Storage, Sharing, and Security examines Big Data management from an R&D perspective. It covers the 3S designs-storage, sharing, and security-through detailed descriptions of Big Data concepts and implementations. Presenting the contributions of recognized Big Data experts from around the world, the book contains more than 450 pages of technical details on the most important implementation aspects regarding Big Data.

Relational Database Design and Implementation, 4th Edition

2016-04-15 O'Reilly Amazon

book

Jan L. Harrington

data data-engineering relational-databases Big Data Cloud Computing Data Modelling

Relational Database Design and Implementation: Clearly Explained, Fourth Edition, provides the conceptual and practical information necessary to develop a database design and management scheme that ensures data accuracy and user satisfaction while optimizing performance. Database systems underlie the large majority of business information systems. Most of those in use today are based on the relational data model, a way of representing data and data relationships using only two-dimensional tables. This book covers relational database theory as well as providing a solid introduction to SQL, the international standard for the relational database data manipulation language. The book begins by reviewing basic concepts of databases and database design, then turns to creating, populating, and retrieving data using SQL. Topics such as the relational data model, normalization, data entities, and Codd's Rules (and why they are important) are covered clearly and concisely. In addition, the book looks at the impact of big data on relational databases and the option of using NoSQL databases for that purpose. Features updated and expanded coverage of SQL and new material on big data, cloud computing, and object-relational databases Presents design approaches that ensure data accuracy and consistency and help boost performance Includes three case studies, each illustrating a different database design challenge Reviews the basic concepts of databases and database design, then turns to creating, populating, and retrieving data using SQL

The Hadoop Performance Myth

2016-04-15 O'Reilly Amazon

book

Courtney Webster

data data-engineering Hadoop Big Data

The wish lists of many data-driven organizations seem reasonable enough. They’d like to capitalize on real-time data analysis, move beyond batch processing for time-critical insights, allow multiple users to share cluster resources, and provide predictable service levels. However, fundamental performance limitations of complex distributed systems such as Hadoop prevent much of this from happening. In this report, Courtney Webster examines the root cause of these performance problems and explains why best practices for mitigating them—cluster tuning, provisioning, and even cluster isolation for mission critical jobs—don’t provide viable, scalable, or long-term solutions. Organizations have been pushing Hadoop and other distributed systems to their performance breaking points as they seek to use clusters as shared resources across multiple business units and individual users. Once they hit this performance wall, companies will find it difficult to deliver on the big data promise at scale. Read this report to find out what the implications are for your organization.

Hadoop Real-World Solutions Cookbook - Second Edition

2016-03-31 O'Reilly Amazon

book

Tanmay Deshpande

data data-engineering Hadoop AI/ML Analytics Big Data

Master the full potential of big data processing using Hadoop with this comprehensive guide. Featuring over 90 practical recipes, this book helps you streamline data workflows and implement machine learning models with tools like Spark, Hive, and Pig. By the end, you'll confidently handle complex data problems and optimize big data solutions effectively. What this Book will help me do Install and manage a Hadoop 2.x cluster efficiently to suit your data processing needs. Explore and utilize advanced tools like Hive, Pig, and Flume for seamless big data analysis. Master data import/export processes with Sqoop and workflows automation using Oozie. Implement machine learning and analytics tasks using Mahout and Apache Spark. Store and process data flexibly across formats like Parquet, ORC, RC, and more. Author(s) None Deshpande is an expert in big data processing and analytics with years of hands-on experience in implementing Hadoop-based solutions for real-world problems. Known for a clear and pragmatic writing style, None brings actionable wisdom and best practices to the forefront, helping readers excel in managing and utilizing big data systems. Who is it for? Designed for technical enthusiasts and professionals, this book is ideal for those familiar with basic big data concepts. If you are looking to expand your expertise in Hadoop's ecosystem and implement data-driven solutions, this book will guide you through essential skills and advanced techniques to efficiently manage complex big data projects.

MongoDB in Action, Second Edition

2016-03-29 O'Reilly Amazon

book

Douglas Garrett , Shaun Verch , Kyle Banker , Tim Hawkins , Peter Bakkum

data data-engineering nosql-databases MongoDB Analytics Big Data

GET MORE WITH MANNING An eBook copy of the previous edition, MongoDB in Action (First Edition), is included at no additional cost. It will be automatically added to your Manning Bookshelf within 24 hours of purchase. MongoDB in Action, Second Edition is a completely revised and updated version. It introduces MongoDB 3.0 and the document-oriented database model. This perfectly paced book gives you both the big picture you'll need as a developer and enough low-level detail to satisfy system engineers. About the Technology This document-oriented database was built for high availability, supports rich, dynamic schemas, and lets you easily distribute data across multiple servers. MongoDB 3.0 is flexible, scalable, and very fast, even with big data loads. About the Book MongoDB in Action, Second Edition is a completely revised and updated version. It introduces MongoDB 3.0 and the document-oriented database model. This perfectly paced book gives you both the big picture you'll need as a developer and enough low-level detail to satisfy system engineers. Lots of examples will help you develop confidence in the crucial area of data modeling. You'll also love the deep explanations of each feature, including replication, auto-sharding, and deployment. What's Inside Indexes, queries, and standard DB operations Aggregation and text searching Map-reduce for custom aggregations and reporting Deploying for scale and high availability Updated for Mongo 3.0 About the Reader Written for developers. No previous MongoDB or NoSQL experience is assumed. About the Authors After working at MongoDB, Kyle Banker is now at a startup. Peter Bakkum is a developer with MongoDB expertise. Shaun Verch has worked on the core server team at MongoDB. A Genentech engineer, Doug Garrett is one of the winners of the MongoDB Innovation Award for Analytics. A software architect, Tim Hawkins has led search engineering at Yahoo Europe. Technical Contributor: Wouter Thielen Technical Editor: Mihalis Tsoukalos Quotes A thorough manual for learning, practicing, and implementing MongoDB - Jeet Marwah, Acer Inc. A must-read to properly use MongoDB and model your data in the best possible way. - Hernan Garcia, Betterez Inc. Provides all the necessary details to get you jump-started with MongoDB. - Gregor Zurowski, Independent Software Development Consultant Awesome! MongoDB in a nutshell. - Hardy Ferentschik, Red Hat

Big Data, Open Data and Data Development

2016-03-28 O'Reilly Amazon

book

Soraya Sedkaoui , Jean-Louis Monino

data data-engineering Big Data Cloud Computing

The world has become digital and technological advances have multiplied circuits with access to data, their processing and their diffusion. New technologies have now reached a certain maturity. Data are available to everyone, anywhere on the planet. The number of Internet users in 2014 was 2.9 billion or 41% of the world population. The need for knowledge is becoming apparent in order to understand this multitude of data. We must educate, inform and train the masses. The development of related technologies, such as the advent of the Internet, social networks, "cloud-computing" (digital factories), has increased the available volumes of data. Currently, each individual creates, consumes, uses digital information: more than 3.4 million e-mails are sent worldwide every second, or 107,000 billion annually with 14,600 e-mails per year per person, but more than 70% are spam. Billions of pieces of content are shared on social networks such as Facebook, more than 2.46 million every minute. We spend more than 4.8 hours a day on the Internet using a computer, and 2.1 hours using a mobile. Data, this new ethereal manna from heaven, is produced in real time. It comes in a continuous stream from a multitude of sources which are generally heterogeneous. This accumulation of data of all types (audio, video, files, photos, etc.) generates new activities, the aim of which is to analyze this enormous mass of information. It is then necessary to adapt and try new approaches, new methods, new knowledge and new ways of working, resulting in new properties and new challenges since SEO logic must be created and implemented. At company level, this mass of data is difficult to manage. Its interpretation is primarily a challenge. This impacts those who are there to "manipulate" the mass and requires a specific infrastructure for creation, storage, processing, analysis and recovery. The biggest challenge lies in "the valuing of data" available in quantity, diversity and access speed.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Big Data Analytics

Hadoop: Data Processing and Modelling

Practical Hive: A Guide to Hadoop's Data Warehouse System

Big Data War

IBM Data Engine for Hadoop and Spark

Sams Teach Yourself Apache Spark™ in 24 Hours

In Search of Database Nirvana

Interactive Spark using PySpark

Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL

The Big Data Market

Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark

Spark GraphX in Action

Big Data

Implementing an Optimized Analytics Solution on IBM Power Systems

Apache Spark Machine Learning Blueprints

Streaming Architecture

Professional Hadoop

Big Data in Practice

Apache Hive Cookbook

Big Data

Relational Database Design and Implementation, 4th Edition

The Hadoop Performance Myth

Hadoop Real-World Solutions Cookbook - Second Edition

MongoDB in Action, Second Edition

Big Data, Open Data and Data Development