Spark

Hadoop Real-World Solutions Cookbook - Second Edition

2016-03-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tanmay Deshpande

AI/ML Analytics Big Data Hadoop Hive ORC Parquet data data-engineering

Master the full potential of big data processing using Hadoop with this comprehensive guide. Featuring over 90 practical recipes, this book helps you streamline data workflows and implement machine learning models with tools like Spark, Hive, and Pig. By the end, you'll confidently handle complex data problems and optimize big data solutions effectively. What this Book will help me do Install and manage a Hadoop 2.x cluster efficiently to suit your data processing needs. Explore and utilize advanced tools like Hive, Pig, and Flume for seamless big data analysis. Master data import/export processes with Sqoop and workflows automation using Oozie. Implement machine learning and analytics tasks using Mahout and Apache Spark. Store and process data flexibly across formats like Parquet, ORC, RC, and more. Author(s) None Deshpande is an expert in big data processing and analytics with years of hands-on experience in implementing Hadoop-based solutions for real-world problems. Known for a clear and pragmatic writing style, None brings actionable wisdom and best practices to the forefront, helping readers excel in managing and utilizing big data systems. Who is it for? Designed for technical enthusiasts and professionals, this book is ideal for those familiar with basic big data concepts. If you are looking to expand your expertise in Hadoop's ecosystem and implement data-driven solutions, this book will guide you through essential skills and advanced techniques to efficiently manage complex big data projects.

2016-03-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Brennon York , Ema Orhian , Kai Sasaki , Ilya Ganelin

AI/ML Big Data Hadoop Java Python Scala Cyber Security SQL Data Streaming apache-spark data data-engineering

Production-targeted Spark guidance with real-world use cases Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production. Written by an expert team well-known in the big data community, this book walks you through the challenges in moving from proof-of-concept or demo Spark applications to live Spark in production. Real use cases provide deep insight into common problems, limitations, challenges, and opportunities, while expert tips and tricks help you get the most out of Spark performance. Coverage includes Spark SQL, Tachyon, Kerberos, ML Lib, YARN, and Mesos, with clear, actionable guidance on resource scheduling, db connectors, streaming, security, and much more. Spark has become the tool of choice for many Big Data problems, with more active contributors than any other Apache Software project. General introductory books abound, but this book is the first to provide deep insight and real-world advice on using Spark in production. Specific guidance, expert tips, and invaluable foresight make this guide an incredibly useful resource for real production settings. Review Spark hardware requirements and estimate cluster size Gain insight from real-world production use cases Tighten security, schedule resources, and fine-tune performance Overcome common problems encountered using Spark in production Spark works with other big data tools including MapReduce and Hadoop, and uses languages you already know like Java, Scala, Python, and R. Lightning speed makes Spark too good to pass up, but understanding limitations and challenges in advance goes a long way toward easing actual production implementation. Spark: Big Data Cluster Computing in Production tells you everything you need to know, with real-world production insight and expert guidance, tips, and tricks.

Real-Time Big Data Analytics

2016-02-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shilpi Saxena

Analytics Kinesis AWS Lambda Big Data Cloud Computing Data Analytics SQL data data-engineering real-time-analytics streaming-messaging

This book delves into the techniques and tools essential for designing, processing, and analyzing complex datasets in real-time using advanced frameworks like Apache Spark, Storm, and Amazon Kinesis. By engaging with this thorough guide, you'll build proficiency in creating robust, efficient, and scalable real-time data processing architectures tailored to real-world scenarios. What this Book will help me do Learn the fundamentals of real-time data processing and how it differs from batch processing. Gain hands-on experience with Apache Storm for creating robust data-driven solutions. Develop real-world applications using Amazon Kinesis for cloud-based analytics. Perform complex data queries and transformations with Spark SQL and understand Spark RDDs. Master the Lambda Architecture to combine batch and real-time analytics effectively. Author(s) Shilpi Saxena is a renowned expert in big data technologies, holding extensive experience in real-time data analytics. With a career spanning years in the industry, Shilpi has provided innovative solutions for big data challenges in top-tier organizations. Her teaching approach emphasizes practical applicability, making her writings accessible and impactful for developers and architects alike. Who is it for? This book is for software professionals such as Big Data architects, developers, or programmers looking to enhance their skills in real-time big data analytics. If you are familiar with basic programming principles and seek to build solutions for processing large data streams in real-time environments, this book caters to your needs. It is also suitable for those seeking to familiarize themselves with using state-of-the-art tools like Spark SQL, Apache Storm, and Amazon Kinesis. Whether you're extending current expertise or transitioning into this field, this resource helps you achieve your objectives.

Fast Data Front Ends for Hadoop

2016-02-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Akmal Chaudhri

Big Data Hadoop HDFS NoSQL SQL Data Streaming data data-engineering

Organizations striving to build applications for streaming data have a new possibility to ponder: the use of ingestion engines at the front end of their Hadoop systems. With this O’Reilly report, you’ll learn how these fast data front ends process data before it reaches the Hadoop Data File System (HDFS), and provide intelligence and context in real time. This helps you reduce response times from hours to minutes, or even minutes to seconds. Author and independent consultant Akmal Chaudhri looks at several popular ingestion engines, including Apache Spark, Apache Storm, and the VoltDB in-memory database. Among them, VoltDB stands out by providing full Atomicity, Consistency, Isolation, and Durability (ACID) support. VoltDB also lets you build a fast data front-end that uses the familiar SQL language and standards. Learn the advantages of ingestion engines as well as the theoretical and practical problems that can come up in an implementation. You’ll discover how this option can handle streaming data, provide state, ensure durability, and support transactions and real-time decisions. Akmal B. Chaudhri is an Independent Consultant, specializing in big data, NoSQL, and NewSQL database technologies. He has previously held roles as a developer, consultant, product strategist, and technical trainer with several blue-chip companies and big data startups. Akmal regularly presents at international conferences and serves on program committees for several major conferences and workshops.

FROM GRP TO IMPACTING THE BOTTOM LINE

2016-02-01 · Superweek 2016

talk

by René Dechamps (Neo@Ogilvy - Spain)

Analytics CRM Funnel Marketing

For years, the advertising industry has relied on so called creative campaigns to boost GRPs and attribute marketing program effectiveness to end of funnel sales. Digital, and more specifically analytics, has brought about promises of transparency through numbers while remaining confined to the realm of measurability. Actors, battling for budgets, are all trying to technologically trace back and attribute the spark that made that very purchase happen, call it attribution or direct, last click, first click, what ever... conversion. After years of experience in the Digital sector, René has joined Neo@Ogilvy, Ogilvy & Mather’s global media agency and performance network where he’s building an Analytics team from scratch. René will share what he’s building, moving beyond traditional site centric Digital Analytics. His challenges encompass data integrations, bringing together CRM data to fuel campaigns, being able to measure the impact of the online channel in offline sales. It’s about helping clients transform the way they use technology and transform their business.

Scalable Big Data Architecture: A Practitioner’s Guide to Choosing Relevant Big Data Architecture

2016-01-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bahaaldine Azarmi

AI/ML Analytics API Big Data ELK Hadoop Kafka Logstash NoSQL SQL data data-engineering +1 more

This book highlights the different types of data architecture and illustrates the many possibilities hidden behind the term "Big Data", from the usage of No-SQL databases to the deployment of stream analytics architecture, machine learning, and governance. Scalable Big Data Architecture covers real-world, concrete industry use cases that leverage complex distributed applications , which involve web applications, RESTful API, and high throughput of large amount of data stored in highly scalable No-SQL data stores such as Couchbase and Elasticsearch. This book demonstrates how data processing can be done at scale from the usage of NoSQL datastores to the combination of Big Data distribution. When the data processing is too complex and involves different processing topology like long running jobs, stream processing, multiple data sources correlation, and machine learning, it’s often necessary to delegate the load to Hadoop or Spark and use the No-SQL to serve processed data in real time. This book shows you how to choose a relevant combination of big data technologies available within the Hadoop ecosystem. It focuses on processing long jobs, architecture, stream data patterns, log analysis, and real time analytics. Every pattern is illustrated with practical examples, which use the different open sourceprojects such as Logstash, Spark, Kafka, and so on. Traditional data infrastructures are built for digesting and rendering data synthesis and analytics from large amount of data. This book helps you to understand why you should consider using machine learning algorithms early on in the project, before being overwhelmed by constraints imposed by dealing with the high throughput of Big data. Scalable Big Data Architecture is for developers, data architects, and data scientists looking for a better understanding of how to choose the most relevant pattern for a Big Data project and which tools to integrate into that pattern.

Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing

2016-01-02 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mohammed Guller

AI/ML Analytics Avro BI Big Data Cassandra Data Analytics ETL/ELT Apache HBase HDFS Kafka Parquet +6 more

This book is a step-by-step guide for learning how to use Spark for different types of big-data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. It covers Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX, MLlib, and Spark ML. Big Data Analytics with Spark shows you how to use Spark and leverage its easy-to-use features to increase your productivity. You learn to perform fast data analysis using its in-memory caching and advanced execution engine, employ in-memory computing capabilities for building high-performance machine learning and low-latency interactive analytics applications, and much more. Moreover, the book shows you how to use Spark as a single integrated platform for a variety of data processing tasks, including ETL pipelines, BI, live data stream processing, graph analytics, and machine learning. The book also includes a chapter on Scala, the hottest functional programming language, and the language that underlies Spark. You’ll learn the basics of functional programming in Scala, so that you can write Spark applications in it. What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, such as HDFS, Avro, Parquet, Kafka, Cassandra, HBase, Mesos, and so on. It also provides an introduction to machine learning and graph concepts. So the book is self-sufficient; all the technologies that you need to know to use Spark are covered. The only thing that you are expected to have is some programming knowledge in any language.

Apache Oozie Essentials

2015-12-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jagat Singh

AI/ML Big Data Hadoop Hive Cyber Security data data-engineering oozie

Apache Oozie Essentials serves as your guide to mastering Apache Oozie, a powerful workflow scheduler for Hadoop environments. Through lucid explanations and practical examples, you will learn how to create, schedule, and enhance workflows for data ingestion, processing, and machine learning tasks using Oozie. What this Book will help me do Install and configure Apache Oozie in your Hadoop environment to start managing workflows. Develop seamless workflows that integrate tools like Hive, Pig, and Sqoop to automate data operations. Set up coordinators to handle timed and dependent job executions efficiently. Deploy Spark jobs within your workflows for machine learning on large datasets. Harness Oozie security features to improve your system's reliability and trustworthiness. Author(s) Authored by None Singh, a seasoned developer with a deep understanding of big data processing and Apache Oozie. With their practical experience, the book intersperses technical detail with real-world examples for an effective learning experience. The author's goal is to make Oozie accessible and useful to professionals. Who is it for? This book is ideal for data engineers and Hadoop professionals looking to streamline their workflow management using Apache Oozie. Whether you're a novice to Oozie or aiming to implement complex data and ML pipelines, the book offers comprehensive guidance tailored to your needs.

Data Munging with Hadoop

2015-11-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Casey Stella , Ofer Mendelevitch

AI/ML Analytics Data Quality Data Science Hadoop Hive NLP data data-engineering

The Example-Rich, Hands-On Guide to Data Munging with Apache Hadoop TM Data scientists spend much of their time “munging” data: handling day-to-day tasks such as data cleansing, normalization, aggregation, sampling, and transformation. These tasks are both critical and surprisingly interesting. Most important, they deepen your understanding of your data’s structure and limitations: crucial insight for improving accuracy and mitigating risk in any analytical project. Now, two leading Hortonworks data scientists, Ofer Mendelevitch and Casey Stella, bring together powerful, practical insights for effective Hadoop-based data munging of large datasets. Drawing on extensive experience with advanced analytics, the authors offer realistic examples that address the common issues you’re most likely to face. They describe each task in detail, presenting example code based on widely used tools such as Pig, Hive, and Spark. This concise, hands-on eBook is valuable for every data scientist, data engineer, and architect who wants to master data munging: not just in theory, but in practice with the field’s #1 platform–Hadoop. Coverage includes A framework for understanding the various types of data quality checks, including cell-based rules, distribution validation, and outlier analysis Assessing tradeoffs in common approaches to imputing missing values Implementing quality checks with Pig or Hive UDFs Transforming raw data into “feature matrix” format for machine learning algorithms Choosing features and instances Implementing text features via “bag-of-words” and NLP techniques Handling time-series data via frequency- or time-domain methods Manipulating feature values to prepare for modeling Data Munging with Hadoop is part of a larger, forthcoming work entitled Data Science Using Hadoop. To be notified when the larger work is available, register your purchase of Data Munging with Hadoop at informit.com/register and check the box “I would like to hear from InformIT and its family of brands about products and special offers.”

Learning Bayesian Models with R

2015-10-28 · O'Reilly Data Science Books O'Reilly Amazon

book

by Hari Manassery Koduvely

AI/ML Big Data Data Science Hadoop bayesian-statistics data data-science data-science-tasks statistics

Dive into the world of Bayesian Machine Learning with "Learning Bayesian Models with R." This comprehensive guide introduces the foundations of probability theory and Bayesian inference, teaches you how to implement these concepts with the R programming language, and progresses to practical techniques for supervised and unsupervised problems in data science. What this Book will help me do Understand and set up an R environment for Bayesian modeling Build Bayesian models including linear regression and classification for predictive analysis Learn to apply Bayesian inference to real-world machine learning problems Work with big data and high-performance computation frameworks like Hadoop and Spark Master advanced Bayesian techniques and apply them to deep learning and AI challenges Author(s) Hari Manassery Koduvely is a proficient data scientist with extensive experience in leveraging Bayesian frameworks for real-world applications. His passion for Bayesian Machine Learning is evident in his approachable and detailed teaching methodology, aimed at making these complex topics accessible for practitioners. Who is it for? This book is best suited for data scientists, analysts, and statisticians familiar with R and basic probability theory who aim to enhance their expertise in Bayesian approaches. It's ideal for professionals tackling machine learning challenges in applied data contexts. If you're looking to incorporate advanced probabilistic methods into your projects, this guide will show you how.

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

2015-10-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Douglas Eadline

Analytics Big Data Data Analytics Data Lake DevOps Hadoop Apache HBase HDFS Hive Linux RDBMS data +1 more

Get Started Fast with Apache Hadoop ® 2, YARN, and Today’s Hadoop Ecosystem With Hadoop 2.x and YARN, Hadoop moves beyond MapReduce to become practical for virtually any type of data processing. Hadoop 2.x and the Data Lake concept represent a radical shift away from conventional approaches to data usage and storage. Hadoop 2.x installations offer unmatched scalability and breakthrough extensibility that supports new and existing Big Data analytics processing methods and models. Hadoop ® 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers, and to navigate the powerful technologies that complement it. Eadline concisely introduces and explains every key Hadoop 2 concept, tool, and service, illustrating each with a simple “beginning-to-end” example and identifying trustworthy, up-to-date resources for learning more. This guide is ideal if you want to learn about Hadoop 2 without getting mired in technical details. Douglas Eadline will bring you up to speed quickly, whether you’re a user, admin, devops specialist, programmer, architect, analyst, or data scientist. Coverage Includes Understanding what Hadoop 2 and YARN do, and how they improve on Hadoop 1 with MapReduce Understanding Hadoop-based Data Lakes versus RDBMS Data Warehouses Installing Hadoop 2 and core services on Linux machines, virtualized sandboxes, or clusters Exploring the Hadoop Distributed File System (HDFS) Understanding the essentials of MapReduce and YARN application programming Simplifying programming and data movement with Apache Pig, Hive, Sqoop, Flume, Oozie, and HBase Observing application progress, controlling jobs, and managing workflows Managing Hadoop efficiently with Apache Ambari–including recipes for HDFS to NFSv3 gateway, HDFS snapshots, and YARN configuration Learning basic Hadoop 2 troubleshooting, and installing Apache Hue and Apache Spark

Sams Teach Yourself: Big Data Analytics with Microsoft HDInsight in 24 Hours

2015-10-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Arshad Ali , Manpreet Singh (Cognizant)

Analytics BI Big Data Cloud Computing Data Analytics Hadoop Apache HBase Hive Microsoft NoSQL Power BI SSIS +2 more

Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours In just 24 lessons of one hour or less, Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours helps you leverage Hadoop’s power on a flexible, scalable cloud platform using Microsoft’s newest business intelligence, visualization, and productivity tools. This book’s straightforward, step-by-step approach shows you how to provision, configure, monitor, and troubleshoot HDInsight and use Hadoop cloud services to solve real analytics problems. You’ll gain more of Hadoop’s benefits, with less complexity–even if you’re completely new to Big Data analytics. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success. Practical, hands-on examples show you how to apply what you learn Quizzes and exercises help you test your knowledge and stretch your skills Notes and tips point out shortcuts and solutions Learn how to… Master core Big Data and NoSQL concepts, value propositions, and use cases Work with key Hadoop features, such as HDFS2 and YARN Quickly install, configure, and monitor Hadoop (HDInsight) clusters in the cloud Automate provisioning, customize clusters, install additional Hadoop projects, and administer clusters Integrate, analyze, and report with Microsoft BI and Power BI Automate workflows for data transformation, integration, and other tasks Use Apache HBase on HDInsight Use Sqoop or SSIS to move data to or from HDInsight Perform R-based statistical computing on HDInsight datasets Accelerate analytics with Apache Spark Run real-time analytics on high-velocity data streams Write MapReduce, Hive, and Pig programs Register your book at informit.com/register for convenient access to downloads, updates, and corrections as they become available.

Hadoop with Python

2015-10-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Donald Miner , Zach Radtka

Analytics API Data Science Hadoop HDFS Java Luigi PySpark Python data data-engineering

Hadoop is mostly written in Java, but that doesn't exclude the use of other programming languages with this distributed storage and processing framework, particularly Python. With this concise book, you’ll learn how to use Python with the Hadoop Distributed File System (HDFS), MapReduce, the Apache Pig platform and Pig Latin script, and the Apache Spark cluster-computing framework. Authors Zachary Radtka and Donald Miner from the data science firm Miner & Kasch take you through the basic concepts behind Hadoop, MapReduce, Pig, and Spark. Then, through multiple examples and use cases, you'll learn how to work with these technologies by applying various Python tools. Use the Python library Snakebite to access HDFS programmatically from within Python applications Write MapReduce jobs in Python with mrjob, the Python MapReduce library Extend Pig Latin with user-defined functions (UDFs) in Python Use the Spark Python API (PySpark) to write Spark programs with Python Learn how to use the Luigi Python workflow scheduler to manage MapReduce jobs and Pig scripts Zachary Radtka, a platform engineer at Miner & Kasch, has extensive experience creating custom analytics that run on petabyte-scale data sets.

Apache Spark Graph Processing

2015-09-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rindra Ramamonjison

Analytics API Big Data Scala apache-spark data data-engineering

Dive into the world of large-scale graph data processing with Apache Spark's GraphX API. This book introduces you to the core concepts of graph analytics and teaches you how to leverage Spark for handling and analyzing massive graphs. From building to analyzing, you'll acquire a comprehensive skillset to work with graph data efficiently. What this Book will help me do Learn to utilize Apache Spark GraphX API to process and analyze graph data. Master transforming raw datasets into sophisticated graph structures. Explore visualization and analysis techniques for understanding graphs. Understand and build custom graph operations tailored to your needs. Implement advanced graph algorithms like clustering and iterative processing. Author(s) Rindra Ramamonjison is a seasoned data engineer with vast experience in big data technologies and graph processing. With a passion for explaining complex concepts in simple terms, Rindra builds on his professional expertise to guide readers in mastering cutting-edge Spark tools. Who is it for? This book is tailored for data scientists and software developers looking to delve into graph data processing at scale. Ideal for those with basic knowledge of Scala and Apache Spark, it equips readers with the tools and techniques to derive insights from complex network datasets. Whether you're diving deeper into big data or exploring graph-specific analytics, this book is your guide.

Learning YARN

2015-08-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Akhil Arora , Shrey Mehrotra

Big Data Hadoop data data-engineering yarn

"Learning YARN" is your comprehensive guide to master YARN, the resource management layer in the Hadoop ecosystem. Through the book, you'll leverage YARN's capabilities for big data processing, learning to deploy, manage, and scale Hadoop-YARN clusters. What this Book will help me do Understand the main features and benefits of the YARN framework. Gain experience managing Hadoop clusters of varying sizes. Learn to integrate YARN with domain-specific big data tools like Spark. Become skilled at administration and configuration of YARN. Develop and run your own YARN-based applications for distributed computing. Author(s) Akhil Arora and Shrey Mehrotra bring with them years of experience working in big data frameworks and technologies. With expertise in YARN specifically, they aim to bridge the gap for developers and administrators to learn and implement scalable big data solutions. Their extensive knowledge in cluster management and distributed data processing shines through in how this book is structured and detailed. Who is it for? This book is ideal for software developers, big data engineers, and system administrators interested in advancing their knowledge in resource management in Hadoop systems. If you have basic familiarity with Hadoop and need a deeper understanding or feature knowledge of YARN for professional growth, this book is tailored for you. It is also suitable for learners seeking to integrate big data platforms like Spark into YARN clusters.

Spark Cookbook

2015-07-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rishi Yadav (Roost.ai)

AI/ML Analytics Big Data Java Python SQL Data Streaming apache-spark data data-engineering

Spark Cookbook is your practical guide to mastering Apache Spark, encompassing a comprehensive set of patterns and examples. Through its over 60 recipes, you will gain actionable insights into using Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX effectively for your big data needs. What this Book will help me do Understand how to install and configure Apache Spark in various environments. Build data pipelines and perform real-time analytics with Spark Streaming. Utilize Spark SQL for interactive data querying and reporting. Apply machine learning workflows using MLlib, including supervised and unsupervised models. Develop optimized big data solutions and integrate them into enterprise platforms. Author(s) None Yadav, the author of Spark Cookbook, is an experienced data engineer and technical expert with deep insights into big data processing frameworks. Yadav has spent years working with Spark and its ecosystem, providing practical guidance to developers and data scientists alike. This book reflects their commitment to sharing actionable knowledge. Who is it for? This book is designed for data engineers, developers, and data scientists who work with big data systems and wish to utilize Apache Spark effectively. Whether you're looking to optimize existing Spark applications or explore its libraries for new use cases, this book will provide the guidance you need. A basic familiarity with big data concepts and programming in languages like Java or Python is recommended to make the most out of this book.

IBM Software Defined Infrastructure for Big Data Analytics Workloads

2015-06-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Marcelo Correia Lima , Dino Quintero , Maciej Olejniczak , Daniel de Souza Casali , Istvan Gabor Szabo , Nilton Carlos dos Santos , Tiago Rodrigues de Mello

Analytics Big Data Cassandra Cloud Computing Data Analytics Hadoop IBM MongoDB data data-engineering ibm-power-systems

This IBM® Redbooks® publication documents how IBM Platform Computing, with its IBM Platform Symphony® MapReduce framework, IBM Spectrum Scale (based Upon IBM GPFS™), IBM Platform LSF®, the Advanced Service Controller for Platform Symphony are work together as an infrastructure to manage not just Hadoop-related offerings, but many popular industry offeringsm such as Apach Spark, Storm, MongoDB, Cassandra, and so on. It describes the different ways to run Hadoop in a big data environment, and demonstrates how IBM Platform Computing solutions, such as Platform Symphony and Platform LSF with its MapReduce Accelerator, can help performance and agility to run Hadoop on distributed workload managers offered by IBM. This information is for technical professionals (consultants, technical support staff, IT architects, and IT specialists) who are responsible for delivering cost-effective cloud services and big data solutions on IBM Power Systems™ to help uncover insights among client’s data so they can optimize product development and business results.

Hadoop Essentials

2015-04-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shiva Achari

Analytics Big Data Data Analytics Data Management Hadoop HDFS Hive data data-engineering

In 'Hadoop Essentials,' you'll embark on an engaging journey to master the Hadoop ecosystem. This book covers fundamental to advanced topics, from HDFS and MapReduce to real-time analytics with Spark, empowering you to handle modern data challenges efficiently. What this Book will help me do Understand the core components of Hadoop, including HDFS, YARN, and MapReduce, for foundational knowledge. Learn to optimize Big Data architectures and improve application performance. Utilize tools like Hive and Pig for efficient data querying and processing. Master data ingestion technologies like Sqoop and Flume for seamless data management. Achieve fluency in real-time data analytics using modern tools like Apache Spark and Apache Storm. Author(s) None Achari is a seasoned expert in Big Data and distributed systems with in-depth knowledge of the Hadoop ecosystem. With years of experience in both development and teaching, they craft content that bridges practical know-how with theoretical insights in a highly accessible style. Who is it for? This book is perfect for system and application developers aiming to learn practical applications of Hadoop. It suits professionals seeking solutions to real-world Big Data challenges as well as those familiar with distributed systems basics and looking to deepen their expertise in advanced data analysis.

Advanced Analytics with Spark

2015-04-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sandy Ryza (Databricks) , Sean Owen (Databricks) , Josh Wills , Uri Laserson

AI/ML Analytics Java Python Scala Cyber Security apache-spark data data-engineering

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications.

Real-World Hadoop

2015-04-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ellen Friedman , Ted Dunning

Big Data Hadoop Apache HBase NoSQL data data-engineering

If you’re a business team leader, CIO, business analyst, or developer interested in how Apache Hadoop and Apache HBase-related technologies can address problems involving large-scale data in cost-effective ways, this book is for you. Using real-world stories and situations, authors Ted Dunning and Ellen Friedman show Hadoop newcomers and seasoned users alike how NoSQL databases and Hadoop can solve a variety of business and research issues. You’ll learn about early decisions and pre-planning that can make the process easier and more productive. If you’re already using these technologies, you’ll discover ways to gain the full range of benefits possible with Hadoop. While you don’t need a deep technical background to get started, this book does provide expert guidance to help managers, architects, and practitioners succeed with their Hadoop projects. Examine a day in the life of big data: India’s ambitious Aadhaar project Review tools in the Hadoop ecosystem such as Apache’s Spark, Storm, and Drill to learn how they can help you Pick up a collection of technical and strategic tips that have helped others succeed with Hadoop Learn from several prototypical Hadoop use cases, based on how organizations have actually applied the technology Explore real-world stories that reveal how MapR customers combine use cases when putting Hadoop and NoSQL to work, including in production Ted Dunning is Chief Applications Architect at MapR Technologies, and committer and PMC member of the Apache’s Drill, Storm, Mahout, and ZooKeeper projects. He is also mentor for Apache’s Datafu, Kylin, Zeppelin, Calcite, and Samoa projects. Ellen Friedman is a solutions consultant, speaker, and author, writing mainly about big data topics. She is a committer for the Apache Mahout project and a contributor to the Apache Drill project.

talk-data.com

Spark

Activity Trend

Top Events

Top Speakers

Hadoop Real-World Solutions Cookbook - Second Edition

Spark

Real-Time Big Data Analytics

Fast Data Front Ends for Hadoop

FROM GRP TO IMPACTING THE BOTTOM LINE

Scalable Big Data Architecture: A Practitioner’s Guide to Choosing Relevant Big Data Architecture

Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing

Apache Oozie Essentials

Data Munging with Hadoop

Learning Bayesian Models with R

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

Sams Teach Yourself: Big Data Analytics with Microsoft HDInsight in 24 Hours

Hadoop with Python

Apache Spark Graph Processing

Learning YARN

Spark Cookbook

IBM Software Defined Infrastructure for Big Data Analytics Workloads

Hadoop Essentials

Advanced Analytics with Spark

Real-World Hadoop