talk-data.com talk-data.com

Event

O'Reilly Data Engineering Books

2001-10-19 – 2027-05-25 Oreilly Visit website ↗

Activities tracked

299

Collection of O'Reilly books on Data Engineering.

Filtering by: Big Data ×

Sessions & talks

Showing 276–299 of 299 · Newest first

Search within this event →
Disruptive Possibilities: How Big Data Changes Everything

Big data has more disruptive potential than any information technology developed in the past 40 years. As author Jeffrey Needham points out in this revealing book, big data can provide unprecedented visibility into the operational efficiency of enterprises and agencies. Disruptive Possibilities provides an historically-informed overview through a wide range of topics, from the evolution of commodity supercomputing and the simplicity of big data technology, to the ways conventional clouds differ from Hadoop analytics clouds. This relentlessly innovative form of computing will soon become standard practice for organizations of any size attempting to derive insight from the tsunami of data engulfing them. Replacing legacy silos—whether they’re infrastructure, organizational, or vendor silos—with a platform-centric perspective is just one of the big stories of big data. To reap maximum value from the myriad forms of data, organizations and vendors will have to adopt highly collaborative habits and methodologies.

Big Data Imperatives: Enterprise 'Big Data' Warehouse, 'BI' Implementations and Analytics

Big Data Imperatives, focuses on resolving the key questions on everyone's mind: Which data matters? Do you have enough data volume to justify the usage? How you want to process this amount of data? How long do you really need to keep it active for your analysis, marketing, and BI applications? Big data is emerging from the realm of one-off projects to mainstream business adoption; however, the real value of big data is not in the overwhelming size of it, but more in its effective use. This book addresses the following big data characteristics: Very large, distributed aggregations of loosely structured data - often incomplete and inaccessible Petabytes/Exabytes of data Millions/billions of people providing/contributing to the context behind the data Flat schema's with few complex interrelationships Involves time-stamped events Made up of incomplete data Includes connections between data elements that must be probabilistically inferred Big Data Imperatives explains 'what big data can do'. It can batch process millions and billions of records both unstructured and structured much faster and cheaper. Big data analytics provide a platform to merge all analysis which enables data analysis to be more accurate, well-rounded, reliable and focused on a specific business capability. Big Data Imperatives describes the complementary nature of traditional data warehouses and big-data analytics platforms and how they feed each other. This book aims to bring the big data and analytics realms together with a greater focus on architectures that leverage the scale and power of big data and the ability to integrate and apply analytics principles to data which earlier was not accessible. This book can also be used as a handbook for practitioners; helping them on methodology,technical architecture, analytics techniques and best practices. At the same time, this book intends to hold the interest of those new to big data and analytics by giving them a deep insight into the realm of big data. What you'll learn Understanding the technology, implementation of big data platforms and their usage for analytics Big data architectures Big data design patterns Implementation best practices Who this book is for This book is designed for IT professionals, data warehousing, business intelligence professionals, data analysis professionals, architects, developers and business users.

Pro Hibernate and MongoDB

Hibernate and MongoDB are a powerful combination of open source persistence and NoSQL technologies for today's Java-based enterprise and cloud application developers. Hibernate is the leading open source Java-based persistence, object relational management engine, recently repositioned as an object grid management engine. MongoDB is a growing, popular open source NoSQL framework, especially popular among cloud application and big data developers. With these two, enterprise and cloud developers have a "complete out of the box" solution. Pro Hibernate and MongoDB shows you how to use and integrate Hibernate and MongoDB. More specifically, this book guides you through the bootstrap; building transactions; handling queries and query entities; and mappings. Then, this book explores the principles and techniques for taking these application principles to the cloud, using the OpenShift Platform as a Service (PaaS) and more. In this book, you get two case studies: An enterprise application using Hibernate and MongoDB. then, A cloud application (OpenShip) migrated from the enterprise application case study After reading or using this book, you come away with the experience from two case studies that give you possible frameworks or templates that you can apply to your own specific application or cloud application building context. What you'll learn How to use and integrate Hibernate and MongoDB to be your "complete out of the box" solution for database driven enterprise and cloud applications How to bootstrap; run in supported environments; do transactions; handle queries and query entities; and mappings How to build an enterprise application case study using Hibernate and MongoDB What are the principles and techniques for taking applications to the Cloud, using the OpenShift Platform as a Service (PaaS) and more How to build a cloud-based app or application (OpenShip) Who this book is for This book is for experienced Java, enterprise Java programmers who may have some experience with Hibernate and/or MongoDB.

Real-Time Big Data Analytics: Emerging Architecture

Five or six years ago, analysts working with big datasets made queries and got the results back overnight. The data world was revolutionized a few years ago when Hadoop and other tools made it possible to getthe results from queries in minutes. But the revolution continues. Analysts now demand sub-second, near real-time query results. Fortunately, we have the tools to deliver them. This report examines tools and technologies that are driving real-time big data analytics.

Using R to Unlock the Value of Big Data: Big Data Analytics with Oracle R Enterprise and Oracle R Connector for Hadoop

The Oracle Press Guide to Big Data Analytics using R Cowritten by members of the Big Data team at Oracle, this Oracle Press book focuses on analyzing data with R while making it scalable using Oracle’s R technologies. Using R to Unlock the Value of Big Data provides an introduction to open source R and describes issues with traditional R and database interaction. The book then offers in-depth coverage of Oracle’s strategic R offerings: Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Connector for Hadoop. You can practice your new skills using the end-of-chapter exercises.

Implementing IBM InfoSphere BigInsights on IBM System x

As world activities become more integrated, the rate of data growth has been increasing exponentially. And as a result of this data explosion, current data management methods can become inadequate. People are using the term big data (sometimes referred to as Big Data) to describe this latest industry trend. IBM® is preparing the next generation of technology to meet these data management challenges. To provide the capability of incorporating big data sources and analytics of these sources, IBM developed a stream-computing product that is based on the open source computing framework Apache Hadoop. Each product in the framework provides unique capabilities to the data management environment, and further enhances the value of your data warehouse investment. In this IBM Redbooks® publication, we describe the need for big data in an organization. We then introduce IBM InfoSphere® BigInsights™ and explain how it differs from standard Hadoop. BigInsights provides a packaged Hadoop distribution, a greatly simplified installation of Hadoop and corresponding open source tools for application development, data movement, and cluster management. BigInsights also brings more options for data security, and as a component of the IBM big data platform, it provides potential integration points with the other components of the platform. A new chapter has been added to this edition. Chapter 11 describes IBM Platform Symphony®, which is a new scheduling product that works with IBM Insights, bringing low-latency scheduling and multi-tenancy to IBM InfoSphere BigInsights. The book is designed for clients, consultants, and other technical professionals.

Principles of Big Data

Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators. Learn general methods for specifying Big Data in a way that is understandable to humans and to computers Avoid the pitfalls in Big Data design and analysis Understand how to create and use Big Data safely and responsibly with a set of laws, regulations and ethical standards that apply to the acquisition, distribution and integration of Big Data resources

Data Warehousing in the Age of Big Data

Data Warehousing in the Age of the Big Data will help you and your organization make the most of unstructured data with your existing data warehouse. As Big Data continues to revolutionize how we use data, it doesn't have to create more confusion. Expert author Krish Krishnan helps you make sense of how Big Data fits into the world of data warehousing in clear and concise detail. The book is presented in three distinct parts. Part 1 discusses Big Data, its technologies and use cases from early adopters. Part 2 addresses data warehousing, its shortcomings, and new architecture options, workloads, and integration techniques for Big Data and the data warehouse. Part 3 deals with data governance, data visualization, information life-cycle management, data scientists, and implementing a Big Data–ready data warehouse. Extensive appendixes include case studies from vendor implementations and a special segment on how we can build a healthcare information factory. Ultimately, this book will help you navigate through the complex layers of Big Data and data warehousing while providing you information on how to effectively think about using all these technologies and the architectures to design the next-generation data warehouse. Learn how to leverage Big Data by effectively integrating it into your data warehouse. Includes real-world examples and use cases that clearly demonstrate Hadoop, NoSQL, HBASE, Hive, and other Big Data technologies Understand how to optimize and tune your current data warehouse infrastructure and integrate newer infrastructure matching data processing workloads and requirements

Addressing Data Volume, Velocity, and Variety with IBM InfoSphere Streams V3.0

There are multiple uses for big data in every industry—from analyzing larger volumes of data than was previously possible to driving more precise answers, to analyzing data at rest and data in motion to capture opportunities that were previously lost. A big data platform will enable your organization to tackle complex problems that previously could not be solved using traditional infrastructure. As the amount of data available to enterprises and other organizations dramatically increases, more and more companies are looking to turn this data into actionable information and intelligence in real time. Addressing these requirements requires applications that are able to analyze potentially enormous volumes and varieties of continuous data streams to provide decision makers with critical information almost instantaneously. This IBM Redbooks® publication is written for decision-makers, consultants, IT architects, and IT professionals who will be implementing a solution with IBM InfoSphere Streams.

MongoDB Applied Design Patterns

Whether you’re building a social media site or an internal-use enterprise application, this hands-on guide shows you the connection between MongoDB and the business problems it’s designed to solve. You’ll learn how to apply MongoDB design patterns to several challenging domains, such as ecommerce, content management, and online gaming. Using Python and JavaScript code examples, you’ll discover how MongoDB lets you scale your data model while simplifying the development process. Many businesses launch NoSQL databases without understanding the techniques for using their features most effectively. This book demonstrates the benefits of document embedding, polymorphic schemas, and other MongoDB patterns for tackling specific big data use cases, including: Operational intelligence: Perform real-time analytics of business data Ecommerce: Use MongoDB as a product catalog master or inventory management system Content management: Learn methods for storing content nodes, binary assets, and discussions Online advertising networks: Apply techniques for frequency capping ad impressions, and keyword targeting and bidding Social networking: Learn how to store a complex social graph, modeled after Google+ Online gaming: Provide concurrent access to character and world data for a multiplayer role-playing game

Hadoop Beginner's Guide

Hadoop Beginner's Guide introduces you to the essential concepts and practical applications of Apache Hadoop, one of the leading frameworks for big data processing. You will learn how to set up and use Hadoop to store, manage, and analyze vast amounts of data efficiently. With clear examples and step-by-step instructions, this book is the perfect starting point for beginners. What this Book will help me do Understand the trends leading to the adoption of Hadoop and determine when to use it effectively in your projects. Build and configure Hadoop clusters tailored to your specific needs, enabling efficient data processing. Develop and execute applications on Hadoop using Java and Ruby, with practical examples provided. Leverage Amazon AWS and Elastic MapReduce to deploy Hadoop on the cloud and manage hosted environments. Integrate Hadoop with relational databases using tools like Hive and Sqoop for effective data transfer and querying. Author(s) The author of Hadoop Beginner's Guide is an experienced data engineer with a focus on big data technologies. They have extensive experience deploying Hadoop in various industries and are passionate about making complex systems accessible to newcomers. Their approach combines technical depth with an understanding of the needs of learners, ensuring clarity and relevance throughout the book. Who is it for? This book is designed for professionals who are new to big data processing and want to learn Apache Hadoop from scratch. It is ideal for system administrators, data analysts, and developers with basic programming knowledge in Java or Ruby looking to get started with Hadoop. If you have an interest in leveraging Hadoop for scalable data management and analytics, this book is for you. By the end, you'll gain the confidence and skills to utilize Hadoop effectively in your projects.

MapReduce Design Patterns

Until now, design patterns for the MapReduce framework have been scattered among various research papers, blogs, and books. This handy guide brings together a unique collection of valuable MapReduce patterns that will save you time and effort regardless of the domain, language, or development framework you’re using. Each pattern is explained in context, with pitfalls and caveats clearly identified to help you avoid common design mistakes when modeling your big data architecture. This book also provides a complete overview of MapReduce that explains its origins and implementations, and why design patterns are so important. All code examples are written for Hadoop. Summarization patterns: get a top-level view by summarizing and grouping data Filtering patterns: view data subsets such as records generated from one user Data organization patterns: reorganize data to work with other systems, or to make MapReduce analysis easier Join patterns: analyze different datasets together to discover interesting relationships Metapatterns: piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job Input and output patterns: customize the way you use Hadoop to load or store data "A clear exposition of MapReduce programs for common data processing patterns—this book is indespensible for anyone using Hadoop." --Tom White, author of Hadoop: The Definitive Guide

HBase in Action

HBase in Action has all the knowledge you need to design, build, and run applications using HBase. First, it introduces you to the fundamentals of distributed systems and large scale data handling. Then, you'll explore real-world applications and code samples with just enough theory to understand the practical techniques. You'll see how to build applications with HBase and take advantage of the MapReduce processing framework. And along the way you'll learn patterns and best practices. About the Technology HBase is a NoSQL storage system designed for fast, random access to large volumes of data. It runs on commodity hardware and scales smoothly from modest datasets to billions of rows and millions of columns. About the Book HBase in Action is an experience-driven guide that shows you how to design, build, and run applications using HBase. First, it introduces you to the fundamentals of handling big data. Then, you'll explore HBase with the help of real applications and code samples and with just enough theory to back up the practical techniques. You'll take advantage of the MapReduce processing framework and benefit from seeing HBase best practices in action. What's Inside When and how to use HBase Practical examples Design patterns for scalable data systems Deployment, integration, and design About the Reader Written for developers and architects familiar with data storage and processing. No prior knowledge of HBase, Hadoop, or MapReduce is required. About the Authors Nick Dimiduk is a Data Architect with experience in social media analytics, digital marketing, and GIS. Amandeep Khurana is a Solutions Architect focused on building HBase-driven solutions. Quotes Timely, practical ... explains in plain language how to use HBase. - From the Foreword by Michael Stack, Chair of the Apache HBase Project Management Committee A difficult topic lucidly explained. - John Griffin, coauthor of "Hibernate Search in Action" Amusing tongue-in-cheek style that doesn’t detract from the substance. - Charles Pyle, APS Healthcare Learn how to think the HBase way. - Gianluca Righetto, Menttis

Hadoop in Practice

Hadoop in Practice collects 85 Hadoop examples and presents them in a problem/solution format. Each technique addresses a specific task you'll face, like querying big data using Pig or writing a log file loader. You'll explore each problem step by step, learning both how to build and deploy that specific solution along with the thinking that went into its design. As you work through the tasks, you'll find yourself growing more comfortable with Hadoop and at home in the world of big data. About the Technology Hadoop is an open source MapReduce platform designed to query and analyze data distributed across large clusters. Especially effective for big data systems, Hadoop powers mission-critical software at Apple, eBay, LinkedIn, Yahoo, and Facebook. It offers developers handy ways to store, manage, and analyze data. About the Book Hadoop in Practice collects 85 battle-tested examples and presents them in a problem/solution format. It balances conceptual foundations with practical recipes for key problem areas like data ingress and egress, serialization, and LZO compression. You'll explore each technique step by step, learning how to build a specific solution along with the thinking that went into it. As a bonus, the book's examples create a well-structured and understandable codebase you can tweak to meet your own needs. What's Inside Conceptual overview of Hadoop and MapReduce 85 practical, tested techniques Real problems, real solutions How to integrate MapReduce and R About the Reader This book assumes you've already started exploring Hadoop and want concrete advice on how to use it in production. About the Author Alex Holmes is a senior software engineer with extensive expertise in solving big data problems using Hadoop. He has presented at JavaOne and Jazoon and is a technical lead at VeriSign. Quotes Interesting topics that tickle the creative brain. - Mark Kemna, Brillig Ties together the Hadoop ecosystem technologies. - Ayon Sinha, Britely Comprehensive … high-quality code samples. - Chris Nauroth, The Walt Disney Company Covers all of the variants of Hadoop, not just the Apache distribution. - Ted Dunning, MapR Technologies Charts a path to the future. - Alexey Gayduk, Grid Dynamics

Getting Started with Storm

Even as big data is turning the world upside down, the next phase of the revolution is already taking shape: real-time data analysis. This hands-on guide introduces you to Storm, a distributed, JVM-based system for processing streaming data. Through simple tutorials, sample Java code, and a complete real-world scenario, you’ll learn how to build fast, fault-tolerant solutions that process results as soon as the data arrives. Discover how easy it is to set up Storm clusters for solving various problems, including continuous data computation, distributed remote procedure calls, and data stream processing. Learn how to program Storm components: spouts for data input and bolts for data transformation Discover how data is exchanged between spouts and bolts in a Storm topology Make spouts fault-tolerant with several commonly used design strategies Explore bolts—their life cycle, strategies for design, and ways to implement them Scale your solution by defining each component’s level of parallelism Study a real-time web analytics system built with Node.js, a Redis server, and a Storm topology Write spouts and bolts with non-JVM languages such as Python, Ruby, and Javascript

Information Storage and Management: Storing, Managing, and Protecting Digital Information in Classic, Virtualized, and Cloud Environments, Second Edition

The new edition of a bestseller, now revised and update throughout! This new edition of the unparalleled bestseller serves as a full training course all in one and as the world's largest data storage company, EMC is the ideal author for such a critical resource. They cover the components of a storage system and the different storage system models while also offering essential new material that explores the advances in existing technologies and the emergence of the "Cloud" as well as updates and vital information on new technologies. Features a separate section on emerging area of cloud computing Covers new technologies such as: data de-duplication, unified storage, continuous data protection technology, virtual provisioning, FCoE, flash drives, storage tiering, big data, and more Details storage models such as Network Attached Storage (NAS), Storage Area Network (SAN), Object Based Storage along with virtualization at various infrastructure components Explores Business Continuity and Security in physical and virtualized environment Includes an enhanced Appendix for additional information This authoritative guide is essential for getting up to speed on the newest advances in information storage and management.

Seven Databases in Seven Weeks

Data is getting bigger and more complex by the day, and so are the choices in handling that data. As a modern application developer you need to understand the emerging field of data management, both RDBMS and NoSQL. Seven Databases in Seven Weeks takes you on a tour of some of the hottest open source databases today. In the tradition of Bruce A. Tate's Seven Languages in Seven Weeks, this book goes beyond your basic tutorial to explore the essential concepts at the core each technology. Redis, Neo4J, CouchDB, MongoDB, HBase, Riak and Postgres. With each database, you'll tackle a real-world data problem that highlights the concepts and features that make it shine. You'll explore the five data models employed by these databases-relational, key/value, columnar, document and graph-and which kinds of problems are best suited to each. You'll learn how MongoDB and CouchDB are strikingly different, and discover the Dynamo heritage at the heart of Riak. Make your applications faster with Redis and more connected with Neo4J. Use MapReduce to solve Big Data problems. Build clusters of servers using scalable services like Amazon's Elastic Compute Cloud (EC2). Discover the CAP theorem and its implications for your distributed data. Understand the tradeoffs between consistency and availability, and when you can use them to your advantage. Use multiple databases in concert to create a platform that's more than the sum of its parts, or find one that meets all your needs at once. Seven Databases in Seven Weeks will take you on a deep dive into each of the databases, their strengths and weaknesses, and how to choose the ones that fit your needs. What You Need: To get the most of of this book you'll have to follow along, and that means you'll need a *nix shell (Mac OSX or Linux preferred, Windows users will need Cygwin), and Java 6 (or greater) and Ruby 1.8.7 (or greater). Each chapter will list the downloads required for that database.

Planning for Big Data

In an age where everything is measurable, understanding big data is an essential. From creating new data-driven products through to increasing operational efficiency, big data has the potential to make your organization both more competitive and more innovative. As this emerging field transitions from the bleeding edge to enterprise infrastructure, it's vital to understand not only the technologies involved, but the organizational and cultural demands of being data-driven. Written by O'Reilly Radar's experts on big data, this anthology describes: The broad industry changes heralded by the big data era What big data is, what it means to your business, and how to start solving data problems The software that makes up the Hadoop big data stack, and the major enterprise vendors' Hadoop solutions The landscape of NoSQL databases and their relative merits How visualization plays an important part in data work

MongoDB in Action

NEWER EDITION AVAILABLE MongoDB in Action, Second Edition is now available. An eBook of this older edition is included at no additional cost when you buy the revised edition! A limited number of pBook copies of this edition are still available. Please contact Manning Support to inquire about purchasing previous edition copies. MongoDB in Action is a comprehensive guide to MongoDB for application developers. The book begins by explaining what makes MongoDB unique and describing its ideal use cases. A series of tutorials designed for MongoDB mastery then leads into detailed examples for leveraging MongoDB in e-commerce, social networking, analytics, and other common applications. About the Technology Big data can mean big headaches. MongoDB is a document-oriented database designed to be flexible, scalable, and very fast, even with big data loads. It's built for high availability, supports rich, dynamic schemas, and lets you easily distribute data across multiple servers. About the Book MongoDB in Action introduces you to MongoDB and the document-oriented database model. This perfectly paced book provides both the big picture you'll need as a developer and enough low-level detail to satisfy a system engineer. Numerous examples will help you develop confidence in the crucial area of data modeling. You'll also love the deep explanations of each feature, including replication, auto-sharding, and deployment. What's Inside Indexes, queries, and standard DB operations Map-reduce for custom aggregations and reporting Schema design patterns Deploying for scale and high availability About the Reader Written for developers. No MongoDB or NoSQL experience required. About the Author Kyle Banker is a software engineer at 10gen where he maintains the official MongoDB drivers for Ruby and C. Quotes Awesome! MongoDB in a nutshell. - Hardy Ferentschik, Red Hat Excellent. Many practical examples. - Curtis Miller, Flatterline Not only the how, but also the why. - Philip Hallstrom, PJKH, LLC Has a developer-centric flavor--an excellent reference. - Rick Wagner, Red Hat A must-read. - Daniel Bretoi, Advanced Energy

IBM InfoSphere Streams: Assembling Continuous Insight in the Information Revolution

In this IBM® Redbooks® publication, we discuss and describe the positioning, functions, capabilities, and advanced programming techniques for IBM InfoSphere™ Streams (V2), a new paradigm and key component of IBM Big Data platform. Data has traditionally been stored in files or databases, and then analyzed by queries and applications. With stream computing, analysis is performed moment by moment as the data is in motion. In fact, the data might never be stored (perhaps only the analytic results). The ability to analyze data in motion is called real-time analytic processing (RTAP). IBM InfoSphere Streams takes a fundamentally different approach to Big Data analytics and differentiates itself with its distributed runtime platform, programming model, and tools for developing and debugging analytic applications that have a high volume and variety of data types. Using in-memory techniques and analyzing record by record enables high velocity. Volume, variety and velocity are the key attributes of Big Data. The data streams that are consumable by IBM InfoSphere Streams can originate from sensors, cameras, news feeds, stock tickers, and a variety of other sources, including traditional databases. It provides an execution platform and services for applications that ingest, filter, analyze, and correlate potentially massive volumes of continuous data streams. This book is intended for professionals that require an understanding of how to process high volumes of streaming data or need information about how to implement systems to satisfy those requirements. See: http://www.redbooks.ibm.com/abstracts/sg247865.html for the IBM InfoSphere Streams (V1) release.

Privacy and Big Data

Much of what constitutes Big Data is information about us. Through our online activities, we leave an easy-to-follow trail of digital footprints that reveal who we are, what we buy, where we go, and much more. This eye-opening book explores the raging privacy debate over the use of personal data, with one undeniable conclusion: once data's been collected, we have absolutely no control over who uses it or how it is used. Personal data is the hottest commodity on the market today—truly more valuable than gold. We are the asset that every company, industry, non-profit, and government wants. Privacy and Big Data introduces you to the players in the personal data game, and explains the stark differences in how the U.S., Europe, and the rest of the world approach the privacy issue. You'll learn about: Collectors: social networking titans that collect, share, and sell user data Users: marketing organizations, government agencies, and many others Data markets: companies that aggregate and sell datasets to anyone Regulators: governments with one policy for commercial data use, and another for providing security

Big Data Glossary

To help you navigate the large number of new data tools available, this guide describes 60 of the most recent innovations, from NoSQL databases and MapReduce approaches to machine learning and visualization tools. Descriptions are based on first-hand experience with these tools in a production environment. This handy glossary also includes a chapter of key terms that help define many of these tool categories: NoSQL Databases—Document-oriented databases using a key/value interface rather than SQL MapReduce—Tools that support distributed computing on large datasets Storage—Technologies for storing data in a distributed way Servers—Ways to rent computing power on remote machines Processing—Tools for extracting valuable information from large datasets Natural Language Processing—Methods for extracting information from human-created text Machine Learning—Tools that automatically perform data analyses, based on results of a one-off analysis Visualization—Applications that present meaningful data graphically Acquisition—Techniques for cleaning up messy public data sources Serialization—Methods to convert data structure or object state into a storable format

Hadoop in Action

Hadoop in Action introduces the subject and teaches you how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show Hadoop use in more complex data analysis tasks. Included are best practices and design patterns of MapReduce programming. About the Technology Big data can be difficult to handle using traditional databases. Apache Hadoop is a NoSQL applications framework that runs on distributed clusters. This lets it scale to huge datasets. If you need analytic information from your data, Hadoop's the way to go. About the Book What's Inside Introduction to MapReduce Examples illustrating ideas in practice Hadoop's Streaming API Other related tools, like Pig and Hive About the Reader This book requires basic Java skills. Knowing basic statistical concepts can help with the more advanced examples. About the Author Chuck Lam is a Senior Engineer at RockYou! He has a PhD in pattern recognition from Stanford University. Quotes A guide for beginners, a source of insight for advanced users. - Philipp K. Janert, Principal Value, LLC A nice mix of the what, why, and how of Hadoop. - Paul Stusiak, Falcon Technologies Corp. Demystifies Hadoop. A great resource! - Rick Wagner, Acxiom Corp. Covers it all! Plus, gives you sweet extras no one else does. - John S. Griffin, Overstock.com An excellent introduction to Hadoop and MapReduce. - Kenneth DeLong, BabyCenter, LLC

Release 2.0: Issue 11

Big Data: when the size and performance requirements for data management become significant design and decision factors for implementing a data management and analysis system. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.