Hadoop

Hadoop Operations

2012-10-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Eric Sammer (Decodable)

HDFS data data-engineering

If you’ve been asked to maintain large and complex Hadoop clusters, this book is a must. Demand for operations-specific material has skyrocketed now that Hadoop is becoming the de facto standard for truly large-scale data processing in the data center. Eric Sammer, Principal Solution Architect at Cloudera, shows you the particulars of running Hadoop in production, from planning, installing, and configuring the system to providing ongoing maintenance. Rather than run through all possible scenarios, this pragmatic operations guide calls out what works, as demonstrated in critical deployments. Get a high-level overview of HDFS and MapReduce: why they exist and how they work Plan a Hadoop deployment, from hardware and OS selection to network requirements Learn setup and configuration details with a list of critical properties Manage resources by sharing a cluster across multiple groups Get a runbook of the most common cluster maintenance tasks Monitor Hadoop clusters—and learn troubleshooting with the help of real-world war stories Use basic tools and techniques to handle backup and catastrophic failure

R in a Nutshell, 2nd Edition

2012-10-09 · O'Reilly Data Science Books O'Reilly Amazon

book

by Joseph Adler

DataViz R data data-science data-science-tools r

If you’re considering R for statistical computing and data visualization, this book provides a quick and practical guide to just about everything you can do with the open source R language and software environment. You’ll learn how to write R functions and use R packages to help you prepare, visualize, and analyze data. Author Joseph Adler illustrates each process with a wealth of examples from medicine, business, and sports. Updated for R 2.14 and 2.15, this second edition includes new and expanded chapters on R performance, the ggplot2 data visualization package, and parallel R computing with Hadoop. Get started quickly with an R tutorial and hundreds of examples Explore R syntax, objects, and other language details Find thousands of user-contributed R packages online, including Bioconductor Learn how to use R to prepare data for analysis Visualize your data with R’s graphics, lattice, and ggplot2 packages Use R to calculate statistical fests, fit models, and compute probability distributions Speed up intensive computations by writing parallel R programs for Hadoop Get a complete desktop reference to R

Hadoop in Practice

2012-10-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alex Holmes

Big Data data data-engineering

Hadoop in Practice collects 85 Hadoop examples and presents them in a problem/solution format. Each technique addresses a specific task you'll face, like querying big data using Pig or writing a log file loader. You'll explore each problem step by step, learning both how to build and deploy that specific solution along with the thinking that went into its design. As you work through the tasks, you'll find yourself growing more comfortable with Hadoop and at home in the world of big data. About the Technology Hadoop is an open source MapReduce platform designed to query and analyze data distributed across large clusters. Especially effective for big data systems, Hadoop powers mission-critical software at Apple, eBay, LinkedIn, Yahoo, and Facebook. It offers developers handy ways to store, manage, and analyze data. About the Book Hadoop in Practice collects 85 battle-tested examples and presents them in a problem/solution format. It balances conceptual foundations with practical recipes for key problem areas like data ingress and egress, serialization, and LZO compression. You'll explore each technique step by step, learning how to build a specific solution along with the thinking that went into it. As a bonus, the book's examples create a well-structured and understandable codebase you can tweak to meet your own needs. What's Inside Conceptual overview of Hadoop and MapReduce 85 practical, tested techniques Real problems, real solutions How to integrate MapReduce and R About the Reader This book assumes you've already started exploring Hadoop and want concrete advice on how to use it in production. About the Author Alex Holmes is a senior software engineer with extensive expertise in solving big data problems using Hadoop. He has presented at JavaOne and Jazoon and is a technical lead at VeriSign. Quotes Interesting topics that tickle the creative brain. - Mark Kemna, Brillig Ties together the Hadoop ecosystem technologies. - Ayon Sinha, Britely Comprehensive … high-quality code samples. - Chris Nauroth, The Walt Disney Company Covers all of the variants of Hadoop, not just the Apache distribution. - Ted Dunning, MapR Technologies Charts a path to the future. - Alexey Gayduk, Grid Dynamics

Programming Hive

2012-09-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dean Wampler , Jason Rutherglen , Edward Capriolo

DWH ELK Hive NoSQL SQL apache-hive data data-engineering

Need to move a relational database application to Hadoop? This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem. This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. You’ll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data. Use Hive to create, alter, and drop databases, tables, views, functions, and indexes Customize data formats and storage options, from files to external databases Load and extract data from tables—and use queries, grouping, filtering, joining, and other conventional query methods Gain best practices for creating user defined functions (UDFs) Learn Hive patterns you should use and anti-patterns you should avoid Integrate Hive with other data processing programs Use storage handlers for NoSQL databases and other datastores Learn the pros and cons of running Hive on Amazon’s Elastic MapReduce

HBase Administration Cookbook

2012-08-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Yifeng Jiang

Apache HBase data data-engineering nosql-databases

The "HBase Administration Cookbook" is your hands-on guide to mastering HBase administration and configuration. Through practical recipes, this book covers the essential tasks like setting up clusters, optimizing performance, and integrating with the Hadoop ecosystem to manage vast amounts of data effectively. What this Book will help me do Set up and administer HBase clusters for scalability and high availability. Perform routine HBase management tasks confidently and efficiently. Optimize HBase and Hadoop ecosystem settings for maximum performance. Understand troubleshooting to address and resolve typical HBase issues. Leverage advanced configurations for specific read/write-heavy use cases. Author(s) Yifeng Jiang is a seasoned software engineer and database expert with deep experience in working with distributed databases like HBase. He is passionate about teaching and conveying complex concepts through approachable explanations and actionable steps. Yifeng's writing style reflects his hands-on expertise and focus on practical application. Who is it for? This book is designed for system administrators, database managers, and developers looking to master HBase administration and configuration. Whether you are relatively new to HBase with basic familiarity with Hadoop or are an experienced Hadoop administrator wanting to enhance your database management skills, this book provides valuable insights and thorough guidance.

Hadoop: The Definitive Guide, 3rd Edition

2012-05-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tom White

API Avro Cloud Computing DWH Apache HBase HDFS Hive RDBMS data data-engineering

Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN). Store large datasets with the Hadoop Distributed File System (HDFS) Run distributed computations with MapReduce Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence Discover common pitfalls and advanced features for writing real-world MapReduce programs Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud Load data from relational databases into HDFS, using Sqoop Perform large-scale data processing with the Pig query language Analyze datasets with Hive, Hadoop’s data warehousing system Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

Planning for Big Data

2012-03-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Edd Wilder-James

Big Data NoSQL data data-engineering

In an age where everything is measurable, understanding big data is an essential. From creating new data-driven products through to increasing operational efficiency, big data has the potential to make your organization both more competitive and more innovative. As this emerging field transitions from the bleeding edge to enterprise infrastructure, it's vital to understand not only the technologies involved, but the organizational and cultural demands of being data-driven. Written by O'Reilly Radar's experts on big data, this anthology describes: The broad industry changes heralded by the big data era What big data is, what it means to your business, and how to start solving data problems The software that makes up the Hadoop big data stack, and the major enterprise vendors' Hadoop solutions The landscape of NoSQL databases and their relative merits How visualization plays an important part in data work

Programming Pig

2011-10-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alan Gates

Data Modelling HDFS Python data data-engineering pig

This guide is an ideal learning tool and reference for Apache Pig, the open source engine for executing parallel data flows on Hadoop. With Pig, you can batch-process data without having to create a full-fledged application—making it easy for you to experiment with new datasets. Programming Pig introduces new users to Pig, and provides experienced users with comprehensive coverage on key features such as the Pig Latin scripting language, the Grunt shell, and User Defined Functions (UDFs) for extending Pig. If you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig. Delve into Pig’s data model, including scalar and complex data types Write Pig Latin scripts to sort, group, join, project, and filter your data Use Grunt to work with the Hadoop Distributed File System (HDFS) Build complex data processing pipelines with Pig’s macros and modularity features Embed Pig Latin in Python for iterative processing and other advanced tasks Create your own load and store functions to handle data formats and storage mechanisms Get performance tips for running scripts on Hadoop clusters in less time

HBase: The Definitive Guide

2011-09-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Lars George

API Avro Apache HBase Java data data-engineering nosql-databases

If you're looking for a scalable storage solution to accommodate a virtually endless amount of data, this book shows you how Apache HBase can fulfill your needs. As the open source implementation of Google's BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. Many IT executives are asking pointed questions about HBase. This book provides meaningful answers, whether you’re evaluating this non-relational database or planning to put it into practice right away. Discover how tight integration with Hadoop makes scalability with HBase easier Distribute large datasets across an inexpensive cluster of commodity servers Access HBase with native Java clients, or with gateway servers providing REST, Avro, or Thrift APIs Get details on HBase’s architecture, including the storage format, write-ahead log, background processes, and more Integrate HBase with Hadoop's MapReduce framework for massively parallelized data processing jobs Learn how to tune clusters, design schemas, copy tables, import bulk data, decommission nodes, and many other tasks

Professional NoSQL

2011-09-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shashank Tiwari

Cassandra Apache HBase Hive MongoDB NoSQL Redis data data-engineering nosql-databases

A hands-on guide to leveraging NoSQL databases NoSQL databases are an efficient and powerful tool for storing and manipulating vast quantities of data. Most NoSQL databases scale well as data grows. In addition, they are often malleable and flexible enough to accommodate semi-structured and sparse data sets. This comprehensive hands-on guide presents fundamental concepts and practical solutions for getting you ready to use NoSQL databases. Expert author Shashank Tiwari begins with a helpful introduction on the subject of NoSQL, explains its characteristics and typical uses, and looks at where it fits in the application stack. Unique insights help you choose which NoSQL solutions are best for solving your specific data storage needs. Professional NoSQL: Demystifies the concepts that relate to NoSQL databases, including column-family oriented stores, key/value databases, and document databases. Delves into installing and configuring a number of NoSQL products and the Hadoop family of products. Explains ways of storing, accessing, and querying data in NoSQL databases through examples that use MongoDB, HBase, Cassandra, Redis, CouchDB, Google App Engine Datastore and more. Looks at architecture and internals. Provides guidelines for optimal usage, performance tuning, and scalable configurations. Presents a number of tools and utilities relating to NoSQL, distributed platforms, and scalable processing, including Hive, Pig, RRDtool, Nagios, and more.

Hadoop in Action

2010-11-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Chuck Lam

API Big Data Hive Java NoSQL Data Streaming data data-engineering

Hadoop in Action introduces the subject and teaches you how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show Hadoop use in more complex data analysis tasks. Included are best practices and design patterns of MapReduce programming. About the Technology Big data can be difficult to handle using traditional databases. Apache Hadoop is a NoSQL applications framework that runs on distributed clusters. This lets it scale to huge datasets. If you need analytic information from your data, Hadoop's the way to go. About the Book What's Inside Introduction to MapReduce Examples illustrating ideas in practice Hadoop's Streaming API Other related tools, like Pig and Hive About the Reader This book requires basic Java skills. Knowing basic statistical concepts can help with the more advanced examples. About the Author Chuck Lam is a Senior Engineer at RockYou! He has a PhD in pattern recognition from Stanford University. Quotes A guide for beginners, a source of insight for advanced users. - Philipp K. Janert, Principal Value, LLC A nice mix of the what, why, and how of Hadoop. - Paul Stusiak, Falcon Technologies Corp. Demystifies Hadoop. A great resource! - Rick Wagner, Acxiom Corp. Covers it all! Plus, gives you sweet extras no one else does. - John S. Griffin, Overstock.com An excellent introduction to Hadoop and MapReduce. - Kenneth DeLong, BabyCenter, LLC

Hadoop: The Definitive Guide, 2nd Edition

2010-10-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tom White

Avro Cloud Computing DWH Apache HBase HDFS Hive data data-engineering

Discover how Apache Hadoop can unleash the power of your data. This comprehensive resource shows you how to build and maintain reliable, scalable, distributed systems with the Hadoop framework -- an open source implementation of MapReduce, the algorithm on which Google built its empire. Programmers will find details for analyzing datasets of any size, and administrators will learn how to set up and run Hadoop clusters. This revised edition covers recent changes to Hadoop, including new features such as Hive, Sqoop, and Avro. It also provides illuminating case studies that illustrate how Hadoop is used to solve specific problems. Looking to get the most out of your data? This is your book. Use the Hadoop Distributed File System (HDFS) for storing large datasets, then run distributed computations over those datasets with MapReduce Become familiar with Hadoop’s data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world MapReduce programs Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud Use Pig, a high-level query language for large-scale data processing Analyze datasets with Hive, Hadoop’s data warehousing system Take advantage of HBase, Hadoop’s database for structured and semi-structured data Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems "Now you have the opportunity to learn about Hadoop from a master -- not only of the technology, but also of common sense and plain talk." --Doug Cutting, Cloudera

Lucene in Action, Second Edition

2010-07-08 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Erik Hatcher , Otis Gospodnetic , Michael McCandless

API HTML IBM XML data data-engineering lucene search

When Lucene first appeared, this superfast search engine was nothing short of amazing. Today, Lucene still delivers. Its high-performance, easy-to-use API, features like numeric fields, payloads, near-real-time search, and huge increases in indexing and searching speed make it the leading search tool. And with clear writing, reusable examples, and unmatched advice, Lucene in Action, Second Edition is still the definitive guide to effectively integrating search into your applications. This totally revised book shows you how to index your documents, including formats such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, and filtering, and covers the numerous improvements to Lucene since the first edition. Source code is for Lucene 3.0.1. About the Technology About the Book What's Inside Performing hot backups Using numeric fields Tuning for indexing or searching speed Boosting matches with payloads Creating reusable analyzers Adding concurrency with threads Four new case studies Much more! About the Reader About the Authors Michael McCandless is a Lucene PMC member and committer with more than a decade of experience building search engines. Erik Hatcher and Otis Gospodnetić are the authors of the first edition of Lucene in Action and long-time contributors to Lucene, Solr, Mahout, and other Lucene-based projects. Quotes ... brings you up to speed. - Doug Cutting, Founder of Lucene, Nutch, and Hadoop This new edition has it all. - Chad Davis, Blackdog Software, Author of Struts 2 in Action Very readable, full of expert tips. - Rick Wagner, Acxiom Corp. Elegant, and easy to read - just like Lucene itself. - Shai Erera, IBM Haifa Research Labs For a Lucene developer, it's required reading. - Stuart Caborn, Thoughtworks

How to Evaluate the Job You’ve Been Offered

2010-02-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Martha I. Finney

cloudera data data-engineering

This Element is an excerpt from Rebound: A Proven Plan for Starting Over After Job Loss (ISBN: 9780137021147) by Martha I. Finney. Available in print and digital formats. Now that they’ve offered a job, should you take it? Analyze prospective employers rationally and make decisions you won’t regret! Setting aside money for just a moment, so much more goes into deciding whether a potential employer is right for you. You need to know whether the company is a good fit, a reasonably logical choice on your professional progression--not just an invitation to be unemployed again....

Trends Are an Investor’s Best Friend

2010-02-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tom Lydon

cloudera data data-engineering

This Element is an excerpt from The ETF Trend Following Playbook: Profiting from Trends in Bull or Bear Markets with Exchange Traded Funds (ISBN: 9780137029013) by Tom Lydon. Available in print and digital formats. Simple calculations that spot powerful market trends early, so there’s time to cash in on them! Of all the things you can teach yourself to become a better investor, the best is to learn how to identify trends. You probably do it now, to a degree. But by the time news of a trend spreads to the point where it’s cocktail-party fodder, the bulk of the profits have been made. Instead, you need to learn to spot trends as early as possible, to enjoy the longest ride possible.

Pro Hadoop

2009-06-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jason Venner

Cloud Computing data data-engineering

You learn the ins and outs of MapReduce; how to structure a cluster, design, and implement the Hadoop file system; and how to structure your first cloud—computing tasks using Hadoop. Learn how to let Hadoop take care of distributing and parallelizing your software—you just focus on the code, Hadoop takes care of the rest.

Hadoop: The Definitive Guide

2009-06-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tom White

Cloud Computing Apache HBase HDFS data data-engineering

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters. Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you: Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world MapReduce programs Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud Use Pig, a high-level query language for large-scale data processing Take advantage of HBase, Hadoop's database for structured and semi-structured data Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems If you have lots of data -- whether it's gigabytes or petabytes -- Hadoop is the perfect solution. Hadoop: The Definitive Guide is the most thorough book available on the subject. "Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk."-- Doug Cutting, Hadoop Founder, Yahoo!

Big Data is Dead: Long Live Hot Data 🔥

· Small Data SF 2024 Watch

video

Analytics Big Data BigQuery Cloud Computing Data Analytics Data Engineering DuckDB DWH Motherduck Snowflake

Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: Simplifying our work.

Some of us may have Big Data, but our quest to derive insights from it is measured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-is-dead/ Small Data Manifesto: https://motherduck.com/blog/small-data-manifesto/ Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: https://linkedin.com/company/motherduck X/Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/

Explore the "Small Data" movement, a counter-narrative to the prevailing big data conference hype. This talk challenges the assumption that data scale is the most important feature of every workload, defining big data as any dataset too large for a single machine. We'll unpack why this distinction is crucial for modern data engineering and analytics, setting the stage for a new perspective on data architecture.

Delve into the history of big data systems, starting with the non-linear hardware costs that plagued early data practitioners. Discover how Google's foundational papers on GFS, MapReduce, and Bigtable led to the creation of Hadoop, fundamentally changing how we scale data processing. We'll break down the "big data tax"—the inherent latency and system complexity overhead required for distributed systems to function, a critical concept for anyone evaluating data platforms.

Learn about the architectural cornerstone of the modern cloud data warehouse: the separation of storage and compute. This design, popularized by systems like Snowflake and Google BigQuery, allows storage to scale almost infinitely while compute resources are provisioned on-demand. Understand how this model paved the way for massive data lakes but also introduced new complexities and cost considerations that are often overlooked.

We examine the cracks appearing in the big data paradigm, especially for OLAP workloads. While systems like Snowflake are still dominant, the rise of powerful alternatives like DuckDB signals a shift. We reveal the hidden costs of big data analytics, exemplified by a petabyte-scale query costing nearly $6,000, and argue that for most use cases, it's too expensive to run computations over massive datasets.

The key to efficient data processing isn't your total data size, but the size of your "hot data" or working set. This talk argues that the revenge of the single node is here, as modern hardware can often handle the actual data queried without the overhead of the big data tax. This is a crucial optimization technique for reducing cost and improving performance in any data warehouse.

Discover the core principles for designing systems in a post-big data world. We'll show that since only 1 in 500 users run true big data queries, prioritizing simplicity over premature scaling is key. For low latency, process data close to the user with tools like DuckDB and SQLite. This local-first approach offers a compelling alternative to cloud-centric models, enabling faster, more cost-effective, and innovative data architectures.

talk-data.com

Activity Trend

Top Events

Top Speakers

Hadoop Operations

R in a Nutshell, 2nd Edition

Hadoop in Practice

Programming Hive

HBase Administration Cookbook

Hadoop: The Definitive Guide, 3rd Edition

Planning for Big Data

Programming Pig

HBase: The Definitive Guide

Professional NoSQL

Hadoop in Action

Hadoop: The Definitive Guide, 2nd Edition

Lucene in Action, Second Edition

How to Evaluate the Job You’ve Been Offered

Trends Are an Investor’s Best Friend

Pro Hadoop

Hadoop: The Definitive Guide

Big Data is Dead: Long Live Hot Data 🔥