talk-data.com talk-data.com

Topic

Spark

Apache Spark

big_data distributed_computing analytics

581

tagged

Activity Trend

71 peak/qtr
2020-Q1 2026-Q1

Activities

581 activities · Newest first

Practical Predictive Analytics

Dive into the world of predictive analytics with 'Practical Predictive Analytics.' This comprehensive guide walks you through analyzing current and historical data to predict future outcomes. Using tools like R and Spark, you will master practical skills, solve real-world challenges, and apply predictive analytics across domains like marketing, healthcare, and retail. What this Book will help me do Learn the six steps for successfully implementing predictive analytics projects. Acquire practical skills in data cleaning, input, and model deployment using tools like R and Spark. Understand core predictive analytics algorithms and their applications in various industries. Apply data analytics techniques to solve problems in fields such as healthcare and marketing. Master methods for handling big data analytics using Databricks and Spark for effective prediction. Author(s) The author, None Winters, is an experienced data scientist and technical educator. With extensive background in predictive analytics, Winters specializes in applying statistical methods and techniques to real-world consultation scenarios. Winters brings a practical and accessible approach to this text, ensuring that learners can follow along and apply their newfound expertise effectively. Who is it for? This book is ideal for statisticians and analysts with some programming background in languages like R, who want to master predictive analytics skills. It caters to intermediate learners who aim to enhance their ability to solve complex analytical problems. Whether you're looking to advance your career or improve your proficiency in data science, this book will serve as a valuable resource for learning and growth.

Agile Data Science 2.0

Data science teams looking to turn research into useful analytics applications require not only the right tools, but also the right approach if they’re to succeed. With the revised second edition of this hands-on guide, up-and-coming data scientists will learn how to use the Agile Data Science development methodology to build data applications with Python, Apache Spark, Kafka, and other tools. Author Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications with Apache Kafka, MongoDB, ElasticSearch, d3.js, scikit-learn, and Apache Airflow. You’ll learn an iterative approach that lets you quickly change the kind of analysis you’re doing, depending on what the data is telling you. Publish data science work as a web application, and affect meaningful change in your organization. Build value from your data in a series of agile sprints, using the data-value pyramid Extract features for statistical models from a single dataset Visualize data with charts, and expose different aspects through interactive reports Use historical data to predict the future via classification and regression Translate predictions into actions Get feedback from users after each sprint to keep your project on track

Advanced Analytics with Spark, 2nd Edition

In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications. With this book, you will: Familiarize yourself with the Spark programming model Become comfortable within the Spark ecosystem Learn general approaches in data science Examine complete implementations that analyze large public data sets Discover which machine learning tools make sense for particular problems Acquire code that can be adapted to many uses

Apache Spark 2.x Cookbook

Discover how to harness the power of Apache Spark 2.x for your Big Data processing projects. In this book, you will explore over 70 cloud-ready recipes that will guide you to perform distributed data analytics, structured streaming, machine learning, and much more. What this Book will help me do Effectively install and configure Apache Spark with various cluster managers and platforms. Set up and utilize development environments tailored for Spark applications. Operate on schema-aware data using RDDs, DataFrames, and Datasets. Perform real-time streaming analytics with sources such as Apache Kafka. Leverage MLlib for supervised learning, unsupervised learning, and recommendation systems. Author(s) None Yadav is a seasoned data engineer with a deep understanding of Big Data tools and technologies, particularly Apache Spark. With years of experience in the field of distributed computing and data analysis, Yadav brings practical insights and techniques to enrich the learning experience of readers. Who is it for? This book is ideal for data engineers, data scientists, and Big Data professionals who are keen to enhance their Apache Spark 2.x skills. If you're working with distributed processing and want to solve complex data challenges, this book addresses practical problems. Note that a basic understanding of Scala is recommended to get the most out of this resource.

Data Lake for Enterprises

"Data Lake for Enterprises" is a comprehensive guide to building data lakes using the Lambda Architecture. It introduces big data technologies like Hadoop, Spark, and Flume, showing how to use them effectively to manage and leverage enterprise-scale data. You'll gain the skills to design and implement data systems that handle complex data challenges. What this Book will help me do Master the use of Lambda Architecture to create scalable and effective data management systems. Understand and implement technologies like Hadoop, Spark, Kafka, and Flume in an enterprise data lake. Integrate batch and stream processing techniques using big data tools for comprehensive data analysis. Optimize data lakes for performance and reliability with practical insights and techniques. Implement real-world use cases of data lakes and machine learning for predictive data insights. Author(s) None Mishra, None John, and Pankaj Misra are recognized experts in big data systems with a strong background in designing and deploying data solutions. With a clear and methodical teaching style, they bring years of experience to this book, providing readers with the tools and knowledge required to excel in enterprise big data initiatives. Who is it for? This book is ideal for software developers, data architects, and IT professionals looking to integrate a data lake strategy into their enterprises. It caters to readers with a foundational understanding of Java and big data concepts, aiming to advance their practical knowledge of building scalable data systems. If you're eager to delve into cutting-edge technologies and transform enterprise data management, this book is for you.

High Performance Spark

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources. Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing. With this book, you’ll explore: How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure The choice between data joins in Core Spark and Spark SQL Techniques for getting the most out of standard RDD transformations How to work around performance issues in Spark’s key/value pair paradigm Writing high-performance Spark code without Scala or the JVM How to test for functionality and performance when applying suggested improvements Using Spark MLlib and Spark ML machine learning libraries Spark’s Streaming components and external community packages

Machine Learning with Spark - Second Edition

Dive into the world of distributed machine learning with Apache Spark, a powerful framework for handling, processing, and analyzing big data. This book will take you through implementing popular machine learning algorithms using Spark ML, covering end-to-end workflows such as data preparation, model building, predictive analysis, and text processing. What this Book will help me do Learn to implement scalable machine learning solutions using Spark ML. Develop the skills to set up and configure Apache Spark environments. Master the application of machine learning techniques like clustering, classification, and regression with Spark. Efficiently handle and process large-scale datasets using Spark tools. Put Spark's capabilities to work in building real-world distributed data processing solutions. Author(s) None Dua and None Ghotra bring a wealth of experience in big data and machine learning to this book. They have been involved in building scalable data systems and implementing machine learning solutions in various industry scenarios. Their approach is hands-on and focused on teaching practical, actionable knowledge. Who is it for? This book is perfect for data enthusiasts, data engineers, and machine learning practitioners who are familiar with Python and Scala, eager to apply machine learning concepts in distributed environments. It's aimed at professionals looking to develop their skills in building scalable data systems and implementing advanced machine learning workflows in Spark.

Sams Teach Yourself Hadoop in 24 Hours

Apache Hadoop is the technology at the heart of the Big Data revolution, and Hadoop skills are in enormous demand. Now, in just 24 lessons of one hour or less, you can learn all the skills and techniques you'll need to deploy each key component of a Hadoop platform in your local environment or in the cloud, building a fully functional Hadoop cluster and using it with real programs and datasets. Each short, easy lesson builds on all that's come before, helping you master all of Hadoop's essentials, and extend it to meet your unique challenges. Apache Hadoop in 24 Hours, Sams Teach Yourself covers all this, and much more: Understanding Hadoop and the Hadoop Distributed File System (HDFS) Importing data into Hadoop, and process it there Mastering basic MapReduce Java programming, and using advanced MapReduce API concepts Making the most of Apache Pig and Apache Hive Implementing and administering YARN Taking advantage of the full Hadoop ecosystem Managing Hadoop clusters with Apache Ambari Working with the Hadoop User Environment (HUE) Scaling, securing, and troubleshooting Hadoop environments Integrating Hadoop into the enterprise Deploying Hadoop in the cloud Getting started with Apache Spark Step-by-step instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Hadoop to solve a wide spectrum of Big Data problems.

Mastering Spark for Data Science

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced analytical techniques with Spark Learn how to tell a compelling story with data science using Spark’s ecosystem Explore data at scale and work with cutting edge data science methods Who This Book Is For This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes. What You Will Learn Learn the design patterns that integrate Spark into industrialized data science pipelines See how commercial data scientists design scalable code and reusable code for data science services Explore cutting edge data science methods so that you can study trends and causality Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs Find out how Spark can be used as a universal ingestion engine tool and as a web scraper Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams Study advanced Spark concepts, solution design patterns, and integration architectures Demonstrate powerful data science pipelines In Detail Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more. You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly. Style and approach This is an advanced guide for those with beginner-level familiarity with the Spark architecture and working with Data Science applications. Mastering Spark for Data Science is a practical tutorial that uses core Spark APIs and takes a deep dive into advanced libraries including: Spark SQL, visual streaming, and MLlib. This book expands on titles like: Machine Learning with Spark and Learning Spark. It is the next learning curve for those comfortable with Spark and looking to improve their skills.

Learning Apache Spark 2

Dive into the world of Big Data with "Learning Apache Spark 2". This book introduces you to the powerful Apache Spark framework, tailored for real-time data analytics and machine learning. Through practical examples and real-world use-cases, you'll gain hands-on experience in leveraging Spark's capabilities for your data processing needs. What this Book will help me do Master the fundamentals of Apache Spark 2 and its new features. Effectively use Spark SQL, MLlib, RDDs, GraphX, and Spark Streaming to tackle real-world challenges. Gain skills in data processing, transformation, and analysis with Spark. Deploy and operate your Spark applications in clustered environments. Develop your own recommendation engines and predictive analytics models with Spark. Author(s) None Abbasi brings a wealth of expertise in Big Data technologies with a keen focus on simplifying complex concepts for learners. With substantial experience working in data processing frameworks, their approach to teaching creates an engaging and practical learning experience. With "Learning Apache Spark 2", None empowers readers to confidently tackle challenges in Big Data processing and analytics. Who is it for? This book is ideal for aspiring Big Data professionals seeking an accessible introduction to Apache Spark. Beginners in Spark will find step-by-step guidance, while those familiar with earlier versions will appreciate the insights into Spark 2's new features. Familiarity with Big Data concepts and Scala programming is recommended for optimal understanding.

Data Science For Dummies, 2nd Edition

Your ticket to breaking into the field of data science! Jobs in data science are projected to outpace the number of people with data science skills—making those with the knowledge to fill a data science position a hot commodity in the coming years. Data Science For Dummies is the perfect starting point for IT professionals and students interested in making sense of an organization's massive data sets and applying their findings to real-world business scenarios. From uncovering rich data sources to managing large amounts of data within hardware and software limitations, ensuring consistency in reporting, merging various data sources, and beyond, you'll develop the know-how you need to effectively interpret data and tell a story that can be understood by anyone in your organization. Provides a background in data science fundamentals and preparing your data for analysis Details different data visualization techniques that can be used to showcase and summarize your data Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques Includes coverage of big data processing tools like MapReduce, Hadoop, Dremel, Storm, and Spark It's a big, big data world out there—let Data Science For Dummies help you harness its power and gain a competitive edge for your organization.

Learning PySpark

"Learning PySpark" guides you through mastering the integration of Python with Apache Spark to build scalable and efficient data applications. You'll delve into Spark 2.0's architecture, efficiently process data, and explore PySpark's capabilities ranging from machine learning to structured streaming. By the end, you'll be equipped to craft and deploy robust data pipelines and applications. What this Book will help me do Master the Spark 2.0 architecture and its Python integration with PySpark. Leverage PySpark DataFrames and RDDs for effective data manipulation and analysis. Develop scalable machine learning models using PySpark's ML and MLlib libraries. Understand advanced PySpark features such as GraphFrames for graph processing and TensorFrames for deep learning models. Gain expertise in deploying PySpark applications locally and on the cloud for production-ready solutions. Author(s) Authors None Drabas and None Lee bring extensive experience in data engineering and Python programming. They combine a practical, example-driven approach with deep insights into Apache Spark's ecosystem. Their expertise and clarity in writing make this book accessible for individuals aiming to excel in big data technologies with Python. Who is it for? This book is best suited for Python developers who want to integrate Apache Spark 2.0 into their workflow to process large-scale data. Ideal readers will have foundational knowledge of Python and seek to build scalable data-intensive applications using Spark, regardless of prior experience with Spark itself.

Scala: Guide for Data Science Professionals

Scala will be a valuable tool to have on hand during your data science journey for everything from data cleaning to cutting-edge machine learning About This Book Build data science and data engineering solutions with ease An in-depth look at each stage of the data analysis process — from reading and collecting data to distributed analytics Explore a broad variety of data processing, machine learning, and genetic algorithms through diagrams, mathematical formulations, and source code Who This Book Is For This learning path is perfect for those who are comfortable with Scala programming and now want to enter the field of data science. Some knowledge of statistics is expected. What You Will Learn Transfer and filter tabular data to extract features for machine learning Read, clean, transform, and write data to both SQL and NoSQL databases Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations Load data from HDFS and HIVE with ease Run streaming and graph analytics in Spark for exploratory analysis Bundle and scale up Spark jobs by deploying them into a variety of cluster managers Build dynamic workflows for scientific computing Leverage open source libraries to extract patterns from time series Master probabilistic models for sequential data In Detail Scala is especially good for analyzing large sets of data as the scale of the task doesn’t have any significant impact on performance. Scala’s powerful functional libraries can interact with databases and build scalable frameworks — resulting in the creation of robust data pipelines. The first module introduces you to Scala libraries to ingest, store, manipulate, process, and visualize data. Using real world examples, you will learn how to design scalable architecture to process and model data — starting from simple concurrency constructs and progressing to actor systems and Apache Spark. After this, you will also learn how to build interactive visualizations with web frameworks. Once you have become familiar with all the tasks involved in data science, you will explore data analytics with Scala in the second module. You’ll see how Scala can be used to make sense of data through easy to follow recipes. You will learn about Bokeh bindings for exploratory data analysis and quintessential machine learning with algorithms with Spark ML library. You’ll get a sufficient understanding of Spark streaming, machine learning for streaming data, and Spark graphX. Armed with a firm understanding of data analysis, you will be ready to explore the most cutting-edge aspect of data science — machine learning. The final module teaches you the A to Z of machine learning with Scala. You’ll explore Scala for dependency injections and implicits, which are used to write machine learning algorithms. You’ll also explore machine learning topics such as clustering, dimentionality reduction, Naïve Bayes, Regression models, SVMs, neural networks, and more. This learning path combines some of the best that Packt has to offer into one complete, curated package. It includes content from the following Packt products: Scala for Data Science, Pascal Bugnion Scala Data Analysis Cookbook, Arun Manivannan Scala for Machine Learning, Patrick R. Nicolas Style and approach A complete package with all the information necessary to start building useful data engineering and data science solutions straight away. It contains a diverse set of recipes that cover the full spectrum of interesting data analysis tasks and will help you revolutionize your data analysis skills using Scala. Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Summary

There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Matthew Rocklin about Dask and the Blaze ecosystem.

Interview with Matthew Rocklin

Introduction How did you get involved in the area of data engineering? Dask began its life as part of the Blaze project. Can you start by describing what Dask is and how it originated? There are a vast number of tools in the field of data analytics. What are some of the specific use cases that Dask was built for that weren’t able to be solved by the existing options? One of the compelling features of Dask is the fact that it is a Python library that allows for distributed computation at a scale that has largely been the exclusive domain of tools in the Hadoop ecosystem. Why do you think that the JVM has been the reigning platform in the data analytics space for so long? Do you consider Dask, along with the larger Blaze ecosystem, to be a competitor to the Hadoop ecosystem, either now or in the future? Are you seeing many Hadoop or Spark solutions being migrated to Dask? If so, what are the common reasons? There is a strong focus for using Dask as a tool for interactive exploration of data. How does it compare to something like Apache Drill? For anyone looking to integrate Dask into an existing code base that is already using NumPy or Pandas, what does that process look like? How do the task graph capabilities compare to something like Airflow or Luigi? Looking through the documentation for the graph specification in Dask, it appears that there is the potential to introduce cycles or other bugs into a large or complex task chain. Is there any built-in tooling to check for that before submitting the graph for execution? What are some of the most interesting or unexpected projects that you have seen Dask used for? What do you perceive as being the most relevant aspects of Dask for data engineering/data infrastructure practitioners, as compared to the end users of the systems that they support? What are some of the most significant problems that you have been faced with, and which still need to be overcome in the Dask project? I know that the work on Dask is largely performed under the umbrella of PyData and sponsored by Continuum Analytics. What are your thoughts on the financial landscape for open source data analytics and distributed computation frameworks as compared to the broader world of open source projects?

Keep in touch

@mrocklin on Twitter mrocklin on GitHub

Links

http://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments?utm_source=rss&utm_medium=rss https://opendatascience.com/blog/dask-for-institutions/?utm_source=rss&utm_medium=rss Continuum Analytics 2sigma X-Array Tornado

Website Podcast Interview

Airflow Luigi Mesos Kubernetes Spark Dryad Yarn Read The Docs XData

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Pro Apache Phoenix: An SQL Driver for HBase, First Edition

Leverage Phoenix as an ANSI SQL engine built on top of the highly distributed and scalable NoSQL framework HBase. Learn the basics and best practices that are being adopted in Phoenix to enable a high write and read throughput in a big data space. This book includes real-world cases such as Internet of Things devices that send continuous streams to Phoenix, and the book explains how key features such as joins, indexes, transactions, and functions help you understand the simple, flexible, and powerful API that Phoenix provides. Examples are provided using real-time data and data-driven businesses that show you how to collect, analyze, and act in seconds. Pro Apache Phoenix covers the nuances of setting up a distributed HBase cluster with Phoenix libraries, running performance benchmarks, configuring parameters for production scenarios, and viewing the results. The book also shows how Phoenix plays well with other key frameworks in the Hadoop ecosystem such as Apache Spark, Pig, Flume, and Sqoop. You will learn how to: Handle a petabyte data store by applying familiar SQL techniques Store, analyze, and manipulate data in a NoSQL Hadoop echo system with HBase Apply best practices while working with a scalable data store on Hadoop and HBase Integrate popular frameworks (Apache Spark, Pig, Flume) to simplify big data analysis Demonstrate real-time use cases and big data modeling techniques Who This Book Is For Data engineers, Big Data administrators, and architects

Apache Spark for Data Science Cookbook

In "Apache Spark for Data Science Cookbook," you'll delve into solving real-world analytical challenges using the robust Apache Spark framework. This book features hands-on recipes that cover data analysis, distributed machine learning, and real-time data processing. You'll gain practical skills to process, visualize, and extract insights from large datasets efficiently. What this Book will help me do Master using Apache Spark for processing and analyzing large-scale datasets effectively. Harness Spark's MLLib for implementing machine learning algorithms like classification and clustering. Utilize libraries such as NumPy, SciPy, and Pandas in conjunction with Spark for numerical computations. Apply techniques like Natural Language Processing and text mining using Spark-integrated tools. Perform end-to-end data science workflows, including data exploration, modeling, and visualization. Author(s) Nagamallikarjuna Inelu and None Chitturi bring their extensive experience working with data science and distributed computing frameworks like Apache Spark. Nagamallikarjuna specializes in applying machine learning algorithms to big data problems, while None has contributed to various big data system implementations. Together, they focus on providing practitioners with practical and efficient solutions. Who is it for? This book is primarily intended for novice and intermediate data scientists and analysts who are curious about using Apache Spark to tackle data science problems. Readers are expected to have some familiarity with basic data science tasks. If you want to learn practical applications of Spark in data analysis and enhance your big data analytics skills, this resource is for you.

Fast Data Processing Systems with SMACK Stack

Fast Data Processing Systems with SMACK Stack introduces you to the SMACK stack-a combination of Spark, Mesos, Akka, Cassandra, and Kafka. You will learn to integrate these technologies to build scalable, efficient, and real-time data processing platforms tailored for solving critical business challenges. What this Book will help me do Understand the concepts of fast data pipelines and design scalable architectures using the SMACK stack Gain expertise in functional programming with Scala and leverage its power in data processing tasks Build and optimize distributed databases using Apache Cassandra for scaling extensively Deploy and manage real-time data streams using Apache Kafka to handle massive messaging workloads Implement cost-effective cluster infrastructures with Apache Mesos for efficient resource utilization Author(s) None Estrada is an expert in distributed systems and big data technologies. With years of experience implementing SMACK-based solutions across industries, Estrada offers a practical viewpoint to designing scalable systems. Their blend of theoretical knowledge and applied practices ensures readers receive actionable guidance. Who is it for? This book is perfect for software developers, data engineers, or data scientists looking to deepen their understanding of real-time data processing systems. If you have a foundational knowledge of the technologies in the SMACK stack or wish to learn how to combine these cutting-edge tools to solve complex problems, this is for you. Readers with an interest in building efficient big data solutions will find tremendous value here.

Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

The Complete Guide to Data Science with Hadoop—For Technical Professionals, Businesspeople, and Students Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials. Practical Data Science with Hadoop® and Spark The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization. Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP). This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives. Learn What data science is, how it has evolved, and how to plan a data science career How data volume, variety, and velocity shape data science use cases Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark Data importation with Hive and Spark Data quality, preprocessing, preparation, and modeling Visualization: surfacing insights from huge data sets Machine learning: classification, regression, clustering, and anomaly detection Algorithms and Hadoop tools for predictive modeling Cluster analysis and similarity functions Large-scale anomaly detection NLP: applying data science to human language

Expert Hadoop® Administration

The Comprehensive, Up-to-Date Apache Hadoop Administration Handbook and Reference “Sam Alapati has worked with production Hadoop clusters for six years. His unique depth of experience has enabled him to write the go-to resource for all administrators looking to spec, size, expand, and secure production Hadoop clusters of any size.” –Paul Dix, Series Editor In leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples. Expert Hadoop® Administration, Alapati demystifies complex Hadoop environments, helping you understand exactly what happens behind the scenes when you administer your cluster. You’ll gain unprecedented insight as you walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes. The high-value administration skills you learn here will be indispensable no matter what Hadoop distribution you use or what Hadoop applications you run. Understand Hadoop’s architecture from an administrator’s standpoint Create simple and fully distributed clusters Run MapReduce and Spark applications in a Hadoop cluster Manage and protect Hadoop data and high availability Work with HDFS commands, file permissions, and storage management Move data, and use YARN to allocate resources and schedule jobs Manage job workflows with Oozie and Hue Secure, monitor, log, and optimize Hadoop Benchmark and troubleshoot Hadoop

Spark in Action

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0. About the Technology Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades. About the Book Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You'll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you'll start programming Spark using its core APIs. Along the way, you'll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book's code. What's Inside Updated for Spark 2.0 Real-life case studies Spark DevOps with Docker Examples in Scala, and online in Java and Python About the Reader Written for experienced programmers with some background in big data or machine learning. About the Authors Petar Zečević and Marko Bonaći are seasoned developers heavily involved in the Spark community. Quotes Dig in and get your hands dirty with one of the hottest data processing engines today. A great guide. - Jonathan Sharley, Pandora Media Must-have! Speed up your learning of Spark as a distributed computing framework. - Robert Ormandi, Yahoo! An easy-to-follow, step-by-step guide. - Gaurav Bhardwaj, 3Pillar Global An ambitiously comprehensive overview of Spark and its diverse ecosystem. - Jonathan Miller, Optensity