talk-data.com talk-data.com

Topic

Spark

Apache Spark

big_data distributed_computing analytics

143

tagged

Activity Trend

71 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: O'Reilly Data Engineering Books ×
Real-Time Big Data Analytics

This book delves into the techniques and tools essential for designing, processing, and analyzing complex datasets in real-time using advanced frameworks like Apache Spark, Storm, and Amazon Kinesis. By engaging with this thorough guide, you'll build proficiency in creating robust, efficient, and scalable real-time data processing architectures tailored to real-world scenarios. What this Book will help me do Learn the fundamentals of real-time data processing and how it differs from batch processing. Gain hands-on experience with Apache Storm for creating robust data-driven solutions. Develop real-world applications using Amazon Kinesis for cloud-based analytics. Perform complex data queries and transformations with Spark SQL and understand Spark RDDs. Master the Lambda Architecture to combine batch and real-time analytics effectively. Author(s) Shilpi Saxena is a renowned expert in big data technologies, holding extensive experience in real-time data analytics. With a career spanning years in the industry, Shilpi has provided innovative solutions for big data challenges in top-tier organizations. Her teaching approach emphasizes practical applicability, making her writings accessible and impactful for developers and architects alike. Who is it for? This book is for software professionals such as Big Data architects, developers, or programmers looking to enhance their skills in real-time big data analytics. If you are familiar with basic programming principles and seek to build solutions for processing large data streams in real-time environments, this book caters to your needs. It is also suitable for those seeking to familiarize themselves with using state-of-the-art tools like Spark SQL, Apache Storm, and Amazon Kinesis. Whether you're extending current expertise or transitioning into this field, this resource helps you achieve your objectives.

Fast Data Front Ends for Hadoop

Organizations striving to build applications for streaming data have a new possibility to ponder: the use of ingestion engines at the front end of their Hadoop systems. With this O’Reilly report, you’ll learn how these fast data front ends process data before it reaches the Hadoop Data File System (HDFS), and provide intelligence and context in real time. This helps you reduce response times from hours to minutes, or even minutes to seconds. Author and independent consultant Akmal Chaudhri looks at several popular ingestion engines, including Apache Spark, Apache Storm, and the VoltDB in-memory database. Among them, VoltDB stands out by providing full Atomicity, Consistency, Isolation, and Durability (ACID) support. VoltDB also lets you build a fast data front-end that uses the familiar SQL language and standards. Learn the advantages of ingestion engines as well as the theoretical and practical problems that can come up in an implementation. You’ll discover how this option can handle streaming data, provide state, ensure durability, and support transactions and real-time decisions. Akmal B. Chaudhri is an Independent Consultant, specializing in big data, NoSQL, and NewSQL database technologies. He has previously held roles as a developer, consultant, product strategist, and technical trainer with several blue-chip companies and big data startups. Akmal regularly presents at international conferences and serves on program committees for several major conferences and workshops.

Scalable Big Data Architecture: A Practitioner’s Guide to Choosing Relevant Big Data Architecture

This book highlights the different types of data architecture and illustrates the many possibilities hidden behind the term "Big Data", from the usage of No-SQL databases to the deployment of stream analytics architecture, machine learning, and governance. Scalable Big Data Architecture covers real-world, concrete industry use cases that leverage complex distributed applications , which involve web applications, RESTful API, and high throughput of large amount of data stored in highly scalable No-SQL data stores such as Couchbase and Elasticsearch. This book demonstrates how data processing can be done at scale from the usage of NoSQL datastores to the combination of Big Data distribution. When the data processing is too complex and involves different processing topology like long running jobs, stream processing, multiple data sources correlation, and machine learning, it’s often necessary to delegate the load to Hadoop or Spark and use the No-SQL to serve processed data in real time. This book shows you how to choose a relevant combination of big data technologies available within the Hadoop ecosystem. It focuses on processing long jobs, architecture, stream data patterns, log analysis, and real time analytics. Every pattern is illustrated with practical examples, which use the different open sourceprojects such as Logstash, Spark, Kafka, and so on. Traditional data infrastructures are built for digesting and rendering data synthesis and analytics from large amount of data. This book helps you to understand why you should consider using machine learning algorithms early on in the project, before being overwhelmed by constraints imposed by dealing with the high throughput of Big data. Scalable Big Data Architecture is for developers, data architects, and data scientists looking for a better understanding of how to choose the most relevant pattern for a Big Data project and which tools to integrate into that pattern.

Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing

This book is a step-by-step guide for learning how to use Spark for different types of big-data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. It covers Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX, MLlib, and Spark ML. Big Data Analytics with Spark shows you how to use Spark and leverage its easy-to-use features to increase your productivity. You learn to perform fast data analysis using its in-memory caching and advanced execution engine, employ in-memory computing capabilities for building high-performance machine learning and low-latency interactive analytics applications, and much more. Moreover, the book shows you how to use Spark as a single integrated platform for a variety of data processing tasks, including ETL pipelines, BI, live data stream processing, graph analytics, and machine learning. The book also includes a chapter on Scala, the hottest functional programming language, and the language that underlies Spark. You’ll learn the basics of functional programming in Scala, so that you can write Spark applications in it. What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, such as HDFS, Avro, Parquet, Kafka, Cassandra, HBase, Mesos, and so on. It also provides an introduction to machine learning and graph concepts. So the book is self-sufficient; all the technologies that you need to know to use Spark are covered. The only thing that you are expected to have is some programming knowledge in any language.

Apache Oozie Essentials

Apache Oozie Essentials serves as your guide to mastering Apache Oozie, a powerful workflow scheduler for Hadoop environments. Through lucid explanations and practical examples, you will learn how to create, schedule, and enhance workflows for data ingestion, processing, and machine learning tasks using Oozie. What this Book will help me do Install and configure Apache Oozie in your Hadoop environment to start managing workflows. Develop seamless workflows that integrate tools like Hive, Pig, and Sqoop to automate data operations. Set up coordinators to handle timed and dependent job executions efficiently. Deploy Spark jobs within your workflows for machine learning on large datasets. Harness Oozie security features to improve your system's reliability and trustworthiness. Author(s) Authored by None Singh, a seasoned developer with a deep understanding of big data processing and Apache Oozie. With their practical experience, the book intersperses technical detail with real-world examples for an effective learning experience. The author's goal is to make Oozie accessible and useful to professionals. Who is it for? This book is ideal for data engineers and Hadoop professionals looking to streamline their workflow management using Apache Oozie. Whether you're a novice to Oozie or aiming to implement complex data and ML pipelines, the book offers comprehensive guidance tailored to your needs.

Data Munging with Hadoop

The Example-Rich, Hands-On Guide to Data Munging with Apache Hadoop TM Data scientists spend much of their time “munging” data: handling day-to-day tasks such as data cleansing, normalization, aggregation, sampling, and transformation. These tasks are both critical and surprisingly interesting. Most important, they deepen your understanding of your data’s structure and limitations: crucial insight for improving accuracy and mitigating risk in any analytical project. Now, two leading Hortonworks data scientists, Ofer Mendelevitch and Casey Stella, bring together powerful, practical insights for effective Hadoop-based data munging of large datasets. Drawing on extensive experience with advanced analytics, the authors offer realistic examples that address the common issues you’re most likely to face. They describe each task in detail, presenting example code based on widely used tools such as Pig, Hive, and Spark. This concise, hands-on eBook is valuable for every data scientist, data engineer, and architect who wants to master data munging: not just in theory, but in practice with the field’s #1 platform–Hadoop. Coverage includes A framework for understanding the various types of data quality checks, including cell-based rules, distribution validation, and outlier analysis Assessing tradeoffs in common approaches to imputing missing values Implementing quality checks with Pig or Hive UDFs Transforming raw data into “feature matrix” format for machine learning algorithms Choosing features and instances Implementing text features via “bag-of-words” and NLP techniques Handling time-series data via frequency- or time-domain methods Manipulating feature values to prepare for modeling Data Munging with Hadoop is part of a larger, forthcoming work entitled Data Science Using Hadoop. To be notified when the larger work is available, register your purchase of Data Munging with Hadoop at informit.com/register and check the box “I would like to hear from InformIT and its family of brands about products and special offers.”

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

Get Started Fast with Apache Hadoop ® 2, YARN, and Today’s Hadoop Ecosystem With Hadoop 2.x and YARN, Hadoop moves beyond MapReduce to become practical for virtually any type of data processing. Hadoop 2.x and the Data Lake concept represent a radical shift away from conventional approaches to data usage and storage. Hadoop 2.x installations offer unmatched scalability and breakthrough extensibility that supports new and existing Big Data analytics processing methods and models. Hadoop ® 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers, and to navigate the powerful technologies that complement it. Eadline concisely introduces and explains every key Hadoop 2 concept, tool, and service, illustrating each with a simple “beginning-to-end” example and identifying trustworthy, up-to-date resources for learning more. This guide is ideal if you want to learn about Hadoop 2 without getting mired in technical details. Douglas Eadline will bring you up to speed quickly, whether you’re a user, admin, devops specialist, programmer, architect, analyst, or data scientist. Coverage Includes Understanding what Hadoop 2 and YARN do, and how they improve on Hadoop 1 with MapReduce Understanding Hadoop-based Data Lakes versus RDBMS Data Warehouses Installing Hadoop 2 and core services on Linux machines, virtualized sandboxes, or clusters Exploring the Hadoop Distributed File System (HDFS) Understanding the essentials of MapReduce and YARN application programming Simplifying programming and data movement with Apache Pig, Hive, Sqoop, Flume, Oozie, and HBase Observing application progress, controlling jobs, and managing workflows Managing Hadoop efficiently with Apache Ambari–including recipes for HDFS to NFSv3 gateway, HDFS snapshots, and YARN configuration Learning basic Hadoop 2 troubleshooting, and installing Apache Hue and Apache Spark

Sams Teach Yourself: Big Data Analytics with Microsoft HDInsight in 24 Hours

Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours In just 24 lessons of one hour or less, Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours helps you leverage Hadoop’s power on a flexible, scalable cloud platform using Microsoft’s newest business intelligence, visualization, and productivity tools. This book’s straightforward, step-by-step approach shows you how to provision, configure, monitor, and troubleshoot HDInsight and use Hadoop cloud services to solve real analytics problems. You’ll gain more of Hadoop’s benefits, with less complexity–even if you’re completely new to Big Data analytics. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success. Practical, hands-on examples show you how to apply what you learn Quizzes and exercises help you test your knowledge and stretch your skills Notes and tips point out shortcuts and solutions Learn how to… Master core Big Data and NoSQL concepts, value propositions, and use cases Work with key Hadoop features, such as HDFS2 and YARN Quickly install, configure, and monitor Hadoop (HDInsight) clusters in the cloud Automate provisioning, customize clusters, install additional Hadoop projects, and administer clusters Integrate, analyze, and report with Microsoft BI and Power BI Automate workflows for data transformation, integration, and other tasks Use Apache HBase on HDInsight Use Sqoop or SSIS to move data to or from HDInsight Perform R-based statistical computing on HDInsight datasets Accelerate analytics with Apache Spark Run real-time analytics on high-velocity data streams Write MapReduce, Hive, and Pig programs Register your book at informit.com/register for convenient access to downloads, updates, and corrections as they become available.

Hadoop with Python

Hadoop is mostly written in Java, but that doesn't exclude the use of other programming languages with this distributed storage and processing framework, particularly Python. With this concise book, you’ll learn how to use Python with the Hadoop Distributed File System (HDFS), MapReduce, the Apache Pig platform and Pig Latin script, and the Apache Spark cluster-computing framework. Authors Zachary Radtka and Donald Miner from the data science firm Miner & Kasch take you through the basic concepts behind Hadoop, MapReduce, Pig, and Spark. Then, through multiple examples and use cases, you'll learn how to work with these technologies by applying various Python tools. Use the Python library Snakebite to access HDFS programmatically from within Python applications Write MapReduce jobs in Python with mrjob, the Python MapReduce library Extend Pig Latin with user-defined functions (UDFs) in Python Use the Spark Python API (PySpark) to write Spark programs with Python Learn how to use the Luigi Python workflow scheduler to manage MapReduce jobs and Pig scripts Zachary Radtka, a platform engineer at Miner & Kasch, has extensive experience creating custom analytics that run on petabyte-scale data sets.

Apache Spark Graph Processing

Dive into the world of large-scale graph data processing with Apache Spark's GraphX API. This book introduces you to the core concepts of graph analytics and teaches you how to leverage Spark for handling and analyzing massive graphs. From building to analyzing, you'll acquire a comprehensive skillset to work with graph data efficiently. What this Book will help me do Learn to utilize Apache Spark GraphX API to process and analyze graph data. Master transforming raw datasets into sophisticated graph structures. Explore visualization and analysis techniques for understanding graphs. Understand and build custom graph operations tailored to your needs. Implement advanced graph algorithms like clustering and iterative processing. Author(s) Rindra Ramamonjison is a seasoned data engineer with vast experience in big data technologies and graph processing. With a passion for explaining complex concepts in simple terms, Rindra builds on his professional expertise to guide readers in mastering cutting-edge Spark tools. Who is it for? This book is tailored for data scientists and software developers looking to delve into graph data processing at scale. Ideal for those with basic knowledge of Scala and Apache Spark, it equips readers with the tools and techniques to derive insights from complex network datasets. Whether you're diving deeper into big data or exploring graph-specific analytics, this book is your guide.

Learning YARN

"Learning YARN" is your comprehensive guide to master YARN, the resource management layer in the Hadoop ecosystem. Through the book, you'll leverage YARN's capabilities for big data processing, learning to deploy, manage, and scale Hadoop-YARN clusters. What this Book will help me do Understand the main features and benefits of the YARN framework. Gain experience managing Hadoop clusters of varying sizes. Learn to integrate YARN with domain-specific big data tools like Spark. Become skilled at administration and configuration of YARN. Develop and run your own YARN-based applications for distributed computing. Author(s) Akhil Arora and Shrey Mehrotra bring with them years of experience working in big data frameworks and technologies. With expertise in YARN specifically, they aim to bridge the gap for developers and administrators to learn and implement scalable big data solutions. Their extensive knowledge in cluster management and distributed data processing shines through in how this book is structured and detailed. Who is it for? This book is ideal for software developers, big data engineers, and system administrators interested in advancing their knowledge in resource management in Hadoop systems. If you have basic familiarity with Hadoop and need a deeper understanding or feature knowledge of YARN for professional growth, this book is tailored for you. It is also suitable for learners seeking to integrate big data platforms like Spark into YARN clusters.

Spark Cookbook

Spark Cookbook is your practical guide to mastering Apache Spark, encompassing a comprehensive set of patterns and examples. Through its over 60 recipes, you will gain actionable insights into using Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX effectively for your big data needs. What this Book will help me do Understand how to install and configure Apache Spark in various environments. Build data pipelines and perform real-time analytics with Spark Streaming. Utilize Spark SQL for interactive data querying and reporting. Apply machine learning workflows using MLlib, including supervised and unsupervised models. Develop optimized big data solutions and integrate them into enterprise platforms. Author(s) None Yadav, the author of Spark Cookbook, is an experienced data engineer and technical expert with deep insights into big data processing frameworks. Yadav has spent years working with Spark and its ecosystem, providing practical guidance to developers and data scientists alike. This book reflects their commitment to sharing actionable knowledge. Who is it for? This book is designed for data engineers, developers, and data scientists who work with big data systems and wish to utilize Apache Spark effectively. Whether you're looking to optimize existing Spark applications or explore its libraries for new use cases, this book will provide the guidance you need. A basic familiarity with big data concepts and programming in languages like Java or Python is recommended to make the most out of this book.

IBM Software Defined Infrastructure for Big Data Analytics Workloads

This IBM® Redbooks® publication documents how IBM Platform Computing, with its IBM Platform Symphony® MapReduce framework, IBM Spectrum Scale (based Upon IBM GPFS™), IBM Platform LSF®, the Advanced Service Controller for Platform Symphony are work together as an infrastructure to manage not just Hadoop-related offerings, but many popular industry offeringsm such as Apach Spark, Storm, MongoDB, Cassandra, and so on. It describes the different ways to run Hadoop in a big data environment, and demonstrates how IBM Platform Computing solutions, such as Platform Symphony and Platform LSF with its MapReduce Accelerator, can help performance and agility to run Hadoop on distributed workload managers offered by IBM. This information is for technical professionals (consultants, technical support staff, IT architects, and IT specialists) who are responsible for delivering cost-effective cloud services and big data solutions on IBM Power Systems™ to help uncover insights among client’s data so they can optimize product development and business results.

Hadoop Essentials

In 'Hadoop Essentials,' you'll embark on an engaging journey to master the Hadoop ecosystem. This book covers fundamental to advanced topics, from HDFS and MapReduce to real-time analytics with Spark, empowering you to handle modern data challenges efficiently. What this Book will help me do Understand the core components of Hadoop, including HDFS, YARN, and MapReduce, for foundational knowledge. Learn to optimize Big Data architectures and improve application performance. Utilize tools like Hive and Pig for efficient data querying and processing. Master data ingestion technologies like Sqoop and Flume for seamless data management. Achieve fluency in real-time data analytics using modern tools like Apache Spark and Apache Storm. Author(s) None Achari is a seasoned expert in Big Data and distributed systems with in-depth knowledge of the Hadoop ecosystem. With years of experience in both development and teaching, they craft content that bridges practical know-how with theoretical insights in a highly accessible style. Who is it for? This book is perfect for system and application developers aiming to learn practical applications of Hadoop. It suits professionals seeking solutions to real-world Big Data challenges as well as those familiar with distributed systems basics and looking to deepen their expertise in advanced data analysis.

Advanced Analytics with Spark

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications.

Real-World Hadoop

If you’re a business team leader, CIO, business analyst, or developer interested in how Apache Hadoop and Apache HBase-related technologies can address problems involving large-scale data in cost-effective ways, this book is for you. Using real-world stories and situations, authors Ted Dunning and Ellen Friedman show Hadoop newcomers and seasoned users alike how NoSQL databases and Hadoop can solve a variety of business and research issues. You’ll learn about early decisions and pre-planning that can make the process easier and more productive. If you’re already using these technologies, you’ll discover ways to gain the full range of benefits possible with Hadoop. While you don’t need a deep technical background to get started, this book does provide expert guidance to help managers, architects, and practitioners succeed with their Hadoop projects. Examine a day in the life of big data: India’s ambitious Aadhaar project Review tools in the Hadoop ecosystem such as Apache’s Spark, Storm, and Drill to learn how they can help you Pick up a collection of technical and strategic tips that have helped others succeed with Hadoop Learn from several prototypical Hadoop use cases, based on how organizations have actually applied the technology Explore real-world stories that reveal how MapR customers combine use cases when putting Hadoop and NoSQL to work, including in production Ted Dunning is Chief Applications Architect at MapR Technologies, and committer and PMC member of the Apache’s Drill, Storm, Mahout, and ZooKeeper projects. He is also mentor for Apache’s Datafu, Kylin, Zeppelin, Calcite, and Samoa projects. Ellen Friedman is a solutions consultant, speaker, and author, writing mainly about big data topics. She is a committer for the Apache Mahout project and a contributor to the Apache Drill project.

Hadoop: The Definitive Guide, 4th Edition

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, youâ??ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. Youâ??ll learn about recent changes to Hadoop, and explore new case studies on Hadoopâ??s role in healthcare systems and genomics data processing. Learn fundamental components such as MapReduce, HDFS, and YARN Explore MapReduce in depth, including steps for developing applications with it Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN Learn two data formats: Avro for data serialization and Parquet for nested data Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer) Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop Learn the HBase distributed database and the ZooKeeper distributed configuration service

Field Guide to Hadoop

If your organization is about to enter the world of big data, you not only need to decide whether Apache Hadoop is the right platform to use, but also which of its many components are best suited to your task. This field guide makes the exercise manageable by breaking down the Hadoop ecosystem into short, digestible sections. You’ll quickly understand how Hadoop’s projects, subprojects, and related technologies work together. Each chapter introduces a different topic—such as core technologies or data transfer—and explains why certain components may or may not be useful for particular needs. When it comes to data, Hadoop is a whole new ballgame, but with this handy reference, you’ll have a good grasp of the playing field. Topics include: Core technologies—Hadoop Distributed File System (HDFS), MapReduce, YARN, and Spark Database and data management—Cassandra, HBase, MongoDB, and Hive Serialization—Avro, JSON, and Parquet Management and monitoring—Puppet, Chef, Zookeeper, and Oozie Analytic helpers—Pig, Mahout, and MLLib Data transfer—Scoop, Flume, distcp, and Storm Security, access control, auditing—Sentry, Kerberos, and Knox Cloud computing and virtualization—Serengeti, Docker, and Whirr

Learning Spark

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.

Learning Hadoop 2

Delve into the world of big data with 'Learning Hadoop 2', a comprehensive guide to leveraging the capabilities of Hadoop 2 for data processing and analysis. In this book, you will explore the tools and frameworks that integrate with Hadoop, discovering the best ways to design and deploy effective workflows for managing and analyzing large datasets. What this Book will help me do Understand the fundamentals of the MapReduce framework and its applications. Utilize advanced tools such as Samza and Spark for real-time and iterative data processing. Manage large datasets with data mining techniques tailored for Hadoop environments. Deploy Hadoop applications across various infrastructures, including local clusters and cloud services. Create and orchestrate sophisticated data workflows and pipelines with Apache Pig and Oozie. Author(s) Gabriele Modena is an experienced developer and trained data specialist with a keen focus on distributed data processing frameworks. Having worked extensively with big data platforms, Gabriele brings practical insights and a hands-on perspective to technical subjects. His writing is concise and engaging, aiming to render complex concepts accessible. Who is it for? This book is ideal for system and application developers eager to learn practical implementations of the Hadoop framework. Readers should be familiar with the Unix/Linux command-line interface and Java programming. Prior experience with Hadoop will be advantageous, but not necessary.