O'Reilly Data Engineering Books

Data Analytics with Spark Using Python, First edition

2018-06-04 O'Reilly Amazon

book

Jeffrey Aven

data data-engineering apache-spark AI/ML Analytics Cloud Computing

Spark for Data Professionals introduces and solidifies the concepts behind Spark 2.x, teaching working developers, architects, and data professionals exactly how to build practical Spark solutions. Jeffrey Aven covers all aspects of Spark development, including basic programming to SparkSQL, SparkR, Spark Streaming, Messaging, NoSQL and Hadoop integration. Each chapter presents practical exercises deploying Spark to your local or cloud environment, plus programming exercises for building real applications. Unlike other Spark guides, Spark for Data Professionals explains crucial concepts step-by-step, assuming no extensive background as an open source developer. It provides a complete foundation for quickly progressing to more advanced data science and machine learning topics. This guide will help you: Understand Spark basics that will make you a better programmer and cluster “citizen” Master Spark programming techniques that maximize your productivity Choose the right approach for each problem Make the most of built-in platform constructs, including broadcast variables, accumulators, effective partitioning, caching, and checkpointing Leverage powerful tools for managing streaming, structured, semi-structured, and unstructured data

Hands-On Data Warehousing with Azure Data Factory

2018-05-31 O'Reilly Amazon

book

Christian Cote , Giuseppe Ciaburro , Michelle Gutzait

data data-engineering storage-repositories data-warehouse AI/ML Analytics

Dive into the world of ETL (Extract, Transform, Load) with 'Hands-On Data Warehousing with Azure Data Factory'. This book guides readers through the essential techniques for working with Azure Data Factory and SQL Server Integration Services to design, implement, and optimize ETL solutions for both on-premises and cloud data environments. What this Book will help me do Understand and utilize Azure Data Factory and SQL Server Integration Services to build ETL solutions. Design scalable and high-performance ETL architectures tailored to modern data problems. Integrate various Azure services, such as Azure Data Lake Analytics, Machine Learning, and Databricks Spark, into your workflows. Troubleshoot and optimize ETL pipelines and address common challenges in data processing. Create insightful Power BI dashboards to visualize and interact with data from your ETL workflows. Author(s) Authors None Cote, Michelle Gutzait, and Giuseppe Ciaburro bring a wealth of experience in data engineering and cloud technologies to this practical guide. Combining expertise in Azure ecosystem and hands-on Data Warehousing, they deliver actionable insights for working professionals. Who is it for? This book is crafted for software professionals working in data engineering, especially those specializing in ETL processes. Readers with a foundational knowledge of SQL Server and cloud infrastructures will benefit most. If you aspire to implement state-of-the-art ETL pipelines or enhance existing workflows with ADF and SSIS, this book is an ideal resource.

Data Science Fundamentals for Python and MongoDB

2018-05-10 O'Reilly Amazon

book

David Paper

data data-engineering nosql-databases MongoDB AI/ML Data Science

Build the foundational data science skills necessary to work with and better understand complex data science algorithms. This example-driven book provides complete Python coding examples to complement and clarify data science concepts, and enrich the learning experience. Coding examples include visualizations whenever appropriate. The book is a necessary precursor to applying and implementing machine learning algorithms. The book is self-contained. All of the math, statistics, stochastic, and programming skills required to master the content are covered. In-depth knowledge of object-oriented programming isn’t required because complete examples are provided and explained. Data Science Fundamentals with Python and MongoDB is an excellent starting point for those interested in pursuing a career in data science. Like any science, the fundamentals of data science are a prerequisite to competency. Without proficiency in mathematics, statistics, data manipulation, and coding, the path to success is “rocky” at best. The coding examples in this book are concise, accurate, and complete, and perfectly complement the data science concepts introduced. What You'll Learn Prepare for a career in data science Work with complex data structures in Python Simulate with Monte Carlo and Stochastic algorithms Apply linear algebra using vectors and matrices Utilize complex algorithms such as gradient descent and principal component analysis Wrangle, cleanse, visualize, and problem solve with data Use MongoDB and JSON to work with data Who This Book Is For The novice yearning to break into the data science world, and the enthusiast looking to enrich, deepen, and develop data science skills through mastering the underlying fundamentalsthat are sometimes skipped over in the rush to be productive. Some knowledge of object-oriented programming will make learning easier.

Networking Design for HPC and AI on IBM Power Systems

2018-04-26 O'Reilly Amazon

book

Scott Vetter , Rico Franke , Yanil Zeledón Miranda , Tobias Elpelt

data data-engineering IBM AI/ML

This publication provides information about networking design for IBM® High Performance Computing (HPC) and AI for Power Systems™. This paper will help you understand the basic requirements when designing a solution, the components in an infrastructure for HPC and AI Systems, the designing of interconnect and data networks with use cases based in real life scenarios, the administration and the Out-Of-Band management networks. We cover all the necessary requirements, provide a good understanding of the technology and include examples for small, medium and large cluster environments. This paper is intended for IT architects, system designers, data center planners, and system administrators who must design or provide a solution for the infrastructure of a HPC cluster.

Enhancing the IBM Power Systems Platform with IBM Watson Services

2018-04-12 O'Reilly Amazon

book

Soheel Chughtai , Ahmed Azraq , Ahmed Mashhour , Duy V Nguyen , Reginaldo Marcelo Dos Santos

data data-engineering IBM ibm-power-systems Agile/Scrum AI/ML

Abstract This IBM® Redbooks® publication provides an introduction to the IBM POWER® processor architecture. It describes the IBM POWER processor and IBM Power Systems™ servers, highlighting the advantages and benefits of IBM Power Systems servers, IBM AIX®, IBM i, and Linux on Power. This publication showcases typical business scenarios that are powered by Power Systems servers. It provides an introduction to the artificial intelligence (AI) capabilities that IBM Watson® services enable, and how these AI capabilities can be augmented in existing applications by using an agile approach to embed intelligence into every operational process. For each use case, the business benefits of adding Watson services are detailed. This publication gives an overview about each Watson service, and how each one is commonly used in real business scenarios. It gives an introduction to the Watson API explorer, which you can use to try the application programming interfaces (APIs) and their capabilities. The Watson services are positioned against the machine learning capabilities of IBM PowerAI. In this publication, you have a guide about how to set up a development environment on Power Systems servers, a sample code implementation of one of the business cases, and a description of preferred practices to move any application that you develop into production. This publication is intended for technical professionals who are interested in learning about or implementing IBM Watson services on AIX, IBM i, and Linux.

IBM Power System AC922 Introduction and Technical Overview

2018-03-26 O'Reilly Amazon

book

Scott Vetter , Alexandre Bicas Caldeira

data data-engineering IBM AI/ML Analytics Marketing

This IBM® Redpaper™ publication is a comprehensive guide that covers the IBM Power System AC922 server (8335-GTG and 8335-GTW models). The Power AC922 server is the next generation of the IBM Power processor-based systems, which are designed for deep learning and artificial intelligence (AI), high-performance analytics, and high-performance computing (HPC). This paper introduces the major innovative Power AC922 server features and their relevant functions: Powerful IBM POWER9™ processors that offer 16 cores at 2.6 GHz with 3.09 GHz turbo performance or 20 cores at 2.0 GHz with 2.87 GHz turbo for the 8335-GTG Eighteen cores at 2.98 GHz with 3.26 GHz turbo performance or 22 at 2.78 GHz cores with 3.07 GHz turbo for the 8335-GTW IBM Coherent Accelerator Processor Interface (CAPI) 2.0, IBM OpenCAPI™, and second-generation NVIDIA NVLink technology for exceptional processor-to-accelerator intercommunication Up to six dedicated NVIDIA Tesla V100 GPUs This publication is for professionals who want to acquire a better understanding of IBM Power Systems™ products and is intended for the following audiences: Clients Sales and marketing professionals Technical support professionals IBM Business Partners Independent software vendors (ISVs) This paper expands the set of IBM Power Systems documentation by providing a desktop reference that offers a detailed technical description of the Power AC922 server. This paper does not replace the current marketing materials and configuration tools. It is intended as an extra source of information that, together with existing sources, can be used to enhance your knowledge of IBM server solutions.

SQL Server 2017 Developer???s Guide

2018-03-16 O'Reilly Amazon

book

Milo≈° Radivojeviƒá , William Durkin , Dejan Sarka

data data-engineering SQL AI/ML Analytics BI

"SQL Server 2017 Developer's Guide" provides a comprehensive approach to learning and utilizing the new features introduced in SQL Server 2017. From advanced Transact-SQL to integrating R and Python into your database projects, this book equips you with the knowledge to design and develop efficient database applications tailored to modern requirements. What this Book will help me do Master new features in SQL Server 2017 to enhance database application development. Implement In-Memory OLTP and columnstore indexes for optimal performance. Utilize JSON support in SQL Server to integrate modern data formats. Leverage R and Python integration to apply advanced data analytics and machine learning. Learn Linux and container deployment options to expand SQL Server usage scenarios. Author(s) The authors of "SQL Server 2017 Developer's Guide" are industry veterans with extensive experience in database design, business intelligence, and advanced analytics. They bring a practical, hands-on writing style that helps developers apply theoretical concepts effectively. Their commitment to teaching is evident in the clear and detailed guidance provided throughout the book. Who is it for? This book is ideal for database developers and solution architects aiming to build robust database applications with SQL Server 2017. It's a valuable resource for business intelligence developers or analysts seeking to harness SQL Server 2017's advanced features. Some familiarity with SQL Server and T-SQL is recommended to fully leverage the insights provided by this book.

IBM PowerAI: Deep Learning Unleashed on IBM Power Systems Servers

2018-03-07 O'Reilly Amazon

book

Alfonso Jara , Dino Quintero , Shota Tsukamoto , Richard Wale , Bruno C. Faria , Bing He , Chris Parsons

data data-engineering IBM AI/ML TensorFlow

Abstract This IBM® Redbooks® publication is a guide about the IBM PowerAI Deep Learning solution. This book provides an introduction to artificial intelligence (AI) and deep learning (DL), IBM PowerAI, and components of IBM PowerAI, deploying IBM PowerAI, guidelines for working with data and creating models, an introduction to IBM Spectrum™ Conductor Deep Learning Impact (DLI), and case scenarios. IBM PowerAI started as a package of software distributions of many of the major DL software frameworks for model training, such as TensorFlow, Caffe, Torch, Theano, and the associated libraries, such as CUDA Deep Neural Network (cuDNN). The IBM PowerAI software is optimized for performance by using the IBM Power Systems™ servers that are integrated with NVLink. The AI stack foundation starts with servers with accelerators. graphical processing unit (GPU) accelerators are well-suited for the compute-intensive nature of DL training, and servers with the highest CPU to GPU bandwidth, such as IBM Power Systems servers, enable the high-performance data transfer that is required for larger and more complex DL models. This publication targets technical readers, including developers, IT specialists, systems architects, brand specialist, sales team, and anyone looking for a guide about how to understand the IBM PowerAI Deep Learning architecture, framework configuration, application and workload configuration, and user infrastructure.

Spark: The Definitive Guide

2018-02-26 O'Reilly Amazon

book

Matei Zaharia , Bill Chambers

data data-engineering apache-spark AI/ML API Big Data

Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals. Youâ??ll explore the basic operations and common functions of Sparkâ??s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Sparkâ??s scalable machine-learning library. Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasetsâ??Sparkâ??s core APIsâ??through worked examples Dive into Sparkâ??s low-level APIs, RDDs, and execution of SQL and DataFrames Understand how Spark runs on a cluster Debug, monitor, and tune Spark clusters and applications Learn the power of Structured Streaming, Sparkâ??s stream-processing engine Learn how you can apply MLlib to a variety of problems, including classification or recommendation

Data Warehousing in the Age of Artificial Intelligence

2017-10-15 O'Reilly Amazon

book

Eric Boutin , Mike Boyarski , Gary Orenstein , Conor Doherty

data data-engineering storage-repositories data-warehouse AI/ML Analytics

Nearly 7,000 new mobile applications appear every day, and a constant stream of data gives them life. Many organizations rely on a predictive analytics model to turn data into useful business information and ensure the predictions remain accurate as data changes. It can be a complex, time-consuming process. This book shows how to automate and accelerate that process using machine learning (ML) on a modern data warehouse that runs on any cloud. Product specialists from MemSQL explain how today’s modern data warehouses provide the foundations to implement ML algorithms that run efficiently. Through several real-time use cases, you’ll learn how to quickly identify the right metrics to make actionable business decisions. This book explores foundational ML and artificial intelligence concepts to help you understand: How data warehouses accelerate deployment and simplify manageability How companies make a choice between cloud and on-premises deployments for building data processing applications Ways to build analytics and visualizations for business intelligence on historical data The technologies and architecture for building and deploying real-time data pipelines This book demonstrates specific models and examples for building supervised and unsupervised real-time ML applications, and gives practical advice on how to make the choice between building an ML pipeline or buying an existing solution. If you need to use data accurately and efficiently, a real-time data warehouse is a critical business tool.

Introduction to GPUs for Data Analytics

2017-10-15 O'Reilly Amazon

book

Eric Mizell , Roger Biery

data data-engineering AI/ML Analytics Big Data Cloud Computing

Moore’s law has finally run out of steam for CPUs. The number of x86 cores that can be placed cost-effectively on a single chip has reached a practical limit, making higher densities prohibitively expensive for most applications. Fortunately, for big data analytics, machine learning, and database applications, a more capable and cost-effective alternative for scaling compute performance is already available: the graphics processing unit, or GPU. In this report, executives at Kinetica and Sierra Communications explain how incorporating GPUs is ideal for keeping pace with the relentless growth in streaming, complex, and large data confronting organizations today. Technology professionals, business analysts, and data scientists will learn how their organizations can begin implementing GPU-accelerated solutions either on premise or in the cloud. This report explores: How GPUs supplement CPUs to enable continued price/performance gains The many database and data analytics applications that can benefit from GPU acceleration Why GPU databases with user-defined functions (UDFs) can simplify and unify the machine learning/deep learning pipeline How GPU-accelerated databases can process streaming data from the Internet of Things and other sources in real time The performance advantage of GPU databases in demanding geospatial analytics applications How cognitive computing—the most compute-intensive application currently imaginable—is now within reach, using GPUs

Apache Spark 2.x Machine Learning Cookbook

2017-09-22 O'Reilly Amazon

book

Mohammed Guller , Meenakshi Rajendran , Shuen Mei , Broderick Hall , Siamak Amirghodsi

data data-engineering apache-spark AI/ML Analytics Big Data

This book is your gateway to mastering machine learning with Apache Spark 2.x. Through detailed hands-on recipes, you'll delve into building scalable ML models, optimizing big data processes, and enhancing project efficiency. Gain practical knowledge and explore real-world applications of recommendations, clustering, analytics, and more with Spark's powerful capabilities. What this Book will help me do Understand how to integrate Scala and Spark for effective machine learning development. Learn to create scalable recommendation engines using Spark. Master the development of clustering systems to organize unlabelled data at scale. Explore Spark libraries to implement efficient text analytics and search engines. Optimize large-scale data operations, tackling high-dimensional issues with Spark. Author(s) The team of authors brings expertise in machine learning, data science, and Spark technologies. Their combined industry experience and academic knowledge ensure the book is grounded in practical applications while offering theoretical insights. With clear explanations and a step-by-step approach, they aim to simplify complex concepts for developers and data scientists. Who is it for? This book is crafted for Scala developers familiar with machine learning concepts but seeking practical applications with Spark. If you have been implementing models but want to scale them and leverage Spark's robust ecosystem, this guide will serve you well. It is ideal for professionals seeking to deepen their skills in Spark and data science.

Apache Spark 2.x for Java Developers

2017-07-26 O'Reilly Amazon

book

Sourav Gulati , Sumit Kumar

data data-engineering apache-spark AI/ML Analytics API

Delve into mastering big data processing with 'Apache Spark 2.x for Java Developers.' This book provides a practical guide to implementing Apache Spark using the Java APIs, offering a unique opportunity for Java developers to leverage Spark's powerful framework without transitioning to Scala. What this Book will help me do Learn how to process data from formats like XML, JSON, CSV using Spark Core. Implement real-time analytics using Spark Streaming and third-party tools like Kafka. Understand data querying with Spark SQL and master SQL schema processing. Apply machine learning techniques with Spark MLlib to real-world scenarios. Explore graph processing and analytics using Spark GraphX. Author(s) None Kumar and None Gulati, experienced professionals in Java development and big data, bring their wealth of practical experience and passion for teaching to this book. With a clear and concise writing style, they aim to simplify Spark for Java developers, making big data approachable. Who is it for? This book is perfect for Java developers who are eager to expand their skillset into big data processing with Apache Spark. Whether you are a seasoned Spark user or first diving into big data concepts, this book meets you at your level. With practical examples and straightforward explanations, you can unlock the potential of Spark in real-world scenarios.

Mastering Apache Spark 2.x - Second Edition

2017-07-26 O'Reilly Amazon

book

Romeo Kienzler

data data-engineering apache-spark AI/ML Analytics Big Data

Mastering Apache Spark 2.x is the essential guide to harnessing the power of big data processing. Dive into real-time data analytics, machine learning, and cluster computing using Apache Spark's advanced features and modules like Spark SQL and MLlib. What this Book will help me do Gain proficiency in Spark's batch and real-time data processing with SparkSQL. Master techniques for machine learning and deep learning using SparkML and SystemML. Understand the principles of Spark's graph processing with GraphX and GraphFrames. Learn to deploy Apache Spark efficiently on platforms like Kubernetes and IBM Cloud. Optimize Spark cluster performance by configuring parameters effectively. Author(s) Romeo Kienzler is a seasoned professional in big data and machine learning technologies. With years of experience in cloud-based distributed systems, Romeo brings practical insights into leveraging Apache Spark. He combines his deep technical expertise with a clear and engaging writing style. Who is it for? This book is tailored for intermediate Apache Spark users eager to deepen their knowledge in Spark 2.x's advanced features. Ideal for data engineers and big data professionals seeking to enhance their analytics pipelines with Spark. A basic understanding of Spark and Scala is necessary. If you're aiming to optimize Spark for real-world applications, this book is crafted for you.

Frank Kane's Taming Big Data with Apache Spark and Python

2017-06-30 O'Reilly Amazon

book

Frank Kane

data data-engineering apache-spark AI/ML AWS Amazon EMR

This book introduces you to the world of Big Data processing using Apache Spark and Python. You will learn to set up and run Spark on different systems, process massive datasets, and create solutions to real-world Big Data challenges with over 15 hands-on examples included. What this Book will help me do Understand the basics of Apache Spark and its ecosystem. Learn how to process large datasets with Spark RDDs using Python. Implement machine learning models with Spark's MLlib library. Master real-time data processing with Spark Streaming modules. Deploy and run Spark jobs on cloud clusters using AWS EMR. Author(s) Frank Kane spent 9 years working at Amazon and IMDb, handling and solving real-world machine learning and Big Data problems. Today, as an instructional designer and educator, he brings his wealth of experience to learners around the globe by creating accessible, practical learning resources. His teaching is clear, engaging, and designed to prepare students for real-world applications. Who is it for? This book is ideal for data scientists or data analysts seeking to delve into Big Data processing with Apache Spark. Readers who have foundational knowledge of Python, as well as some understanding of data processing principles, will find this book useful to sharpen their skills further. It is designed for those eager to learn the practical applications of Big Data tools in today's industry environments. By the end of this book, you should feel confident tackling Big Data challenges using Spark and Python.

Advanced Analytics with Spark, 2nd Edition

2017-06-12 O'Reilly Amazon

book

Josh Wills , Sandy Ryza , Sean Owen , Uri Laserson

data data-engineering apache-spark AI/ML Analytics Data Science

In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications. With this book, you will: Familiarize yourself with the Spark programming model Become comfortable within the Spark ecosystem Learn general approaches in data science Examine complete implementations that analyze large public data sets Discover which machine learning tools make sense for particular problems Acquire code that can be adapted to many uses

Apache Spark 2.x Cookbook

2017-05-31 O'Reilly Amazon

book

Rishi Yadav

data data-engineering apache-spark AI/ML Analytics Big Data

Discover how to harness the power of Apache Spark 2.x for your Big Data processing projects. In this book, you will explore over 70 cloud-ready recipes that will guide you to perform distributed data analytics, structured streaming, machine learning, and much more. What this Book will help me do Effectively install and configure Apache Spark with various cluster managers and platforms. Set up and utilize development environments tailored for Spark applications. Operate on schema-aware data using RDDs, DataFrames, and Datasets. Perform real-time streaming analytics with sources such as Apache Kafka. Leverage MLlib for supervised learning, unsupervised learning, and recommendation systems. Author(s) None Yadav is a seasoned data engineer with a deep understanding of Big Data tools and technologies, particularly Apache Spark. With years of experience in the field of distributed computing and data analysis, Yadav brings practical insights and techniques to enrich the learning experience of readers. Who is it for? This book is ideal for data engineers, data scientists, and Big Data professionals who are keen to enhance their Apache Spark 2.x skills. If you're working with distributed processing and want to solve complex data challenges, this book addresses practical problems. Note that a basic understanding of Scala is recommended to get the most out of this resource.

Data Lake for Enterprises

2017-05-31 O'Reilly Amazon

book

Pankaj Misra , Tomcy John , Vivek Mishra

data data-engineering storage-repositories data-lake AI/ML AWS Lambda

"Data Lake for Enterprises" is a comprehensive guide to building data lakes using the Lambda Architecture. It introduces big data technologies like Hadoop, Spark, and Flume, showing how to use them effectively to manage and leverage enterprise-scale data. You'll gain the skills to design and implement data systems that handle complex data challenges. What this Book will help me do Master the use of Lambda Architecture to create scalable and effective data management systems. Understand and implement technologies like Hadoop, Spark, Kafka, and Flume in an enterprise data lake. Integrate batch and stream processing techniques using big data tools for comprehensive data analysis. Optimize data lakes for performance and reliability with practical insights and techniques. Implement real-world use cases of data lakes and machine learning for predictive data insights. Author(s) None Mishra, None John, and Pankaj Misra are recognized experts in big data systems with a strong background in designing and deploying data solutions. With a clear and methodical teaching style, they bring years of experience to this book, providing readers with the tools and knowledge required to excel in enterprise big data initiatives. Who is it for? This book is ideal for software developers, data architects, and IT professionals looking to integrate a data lake strategy into their enterprises. It caters to readers with a foundational understanding of Java and big data concepts, aiming to advance their practical knowledge of building scalable data systems. If you're eager to delve into cutting-edge technologies and transform enterprise data management, this book is for you.

High Performance Spark

2017-05-25 O'Reilly Amazon

book

Rachel Warren , Holden Karau

data data-engineering apache-spark AI/ML Scala Spark

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources. Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing. With this book, you’ll explore: How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure The choice between data joins in Core Spark and Spark SQL Techniques for getting the most out of standard RDD transformations How to work around performance issues in Spark’s key/value pair paradigm Writing high-performance Spark code without Scala or the JVM How to test for functionality and performance when applying suggested improvements Using Spark MLlib and Spark ML machine learning libraries Spark’s Streaming components and external community packages

Machine Learning with Spark - Second Edition

2017-04-28 O'Reilly Amazon

book

Rajdeep Dua , Brian O'Neill , Manpreet Singh Ghotra , Stephen Boesch , Nick Pentreath

data data-engineering apache-spark AI/ML Big Data Python

Dive into the world of distributed machine learning with Apache Spark, a powerful framework for handling, processing, and analyzing big data. This book will take you through implementing popular machine learning algorithms using Spark ML, covering end-to-end workflows such as data preparation, model building, predictive analysis, and text processing. What this Book will help me do Learn to implement scalable machine learning solutions using Spark ML. Develop the skills to set up and configure Apache Spark environments. Master the application of machine learning techniques like clustering, classification, and regression with Spark. Efficiently handle and process large-scale datasets using Spark tools. Put Spark's capabilities to work in building real-world distributed data processing solutions. Author(s) None Dua and None Ghotra bring a wealth of experience in big data and machine learning to this book. They have been involved in building scalable data systems and implementing machine learning solutions in various industry scenarios. Their approach is hands-on and focused on teaching practical, actionable knowledge. Who is it for? This book is perfect for data enthusiasts, data engineers, and machine learning practitioners who are familiar with Python and Scala, eager to apply machine learning concepts in distributed environments. It's aimed at professionals looking to develop their skills in building scalable data systems and implementing advanced machine learning workflows in Spark.

Mastering Spark for Data Science

2017-03-29 O'Reilly Amazon

book

Matthew Hallett , David George , Antoine Amend , Andrew Morgan

data data-engineering apache-spark AI/ML Analytics API

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced analytical techniques with Spark Learn how to tell a compelling story with data science using Spark’s ecosystem Explore data at scale and work with cutting edge data science methods Who This Book Is For This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes. What You Will Learn Learn the design patterns that integrate Spark into industrialized data science pipelines See how commercial data scientists design scalable code and reusable code for data science services Explore cutting edge data science methods so that you can study trends and causality Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs Find out how Spark can be used as a universal ingestion engine tool and as a web scraper Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams Study advanced Spark concepts, solution design patterns, and integration architectures Demonstrate powerful data science pipelines In Detail Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more. You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly. Style and approach This is an advanced guide for those with beginner-level familiarity with the Spark architecture and working with Data Science applications. Mastering Spark for Data Science is a practical tutorial that uses core Spark APIs and takes a deep dive into advanced libraries including: Spark SQL, visual streaming, and MLlib. This book expands on titles like: Machine Learning with Spark and Learning Spark. It is the next learning curve for those comfortable with Spark and looking to improve their skills.

Learning Apache Spark 2

2017-03-28 O'Reilly Amazon

book

Muhammad Asif Abbasi

data data-engineering apache-spark AI/ML Analytics Big Data

Dive into the world of Big Data with "Learning Apache Spark 2". This book introduces you to the powerful Apache Spark framework, tailored for real-time data analytics and machine learning. Through practical examples and real-world use-cases, you'll gain hands-on experience in leveraging Spark's capabilities for your data processing needs. What this Book will help me do Master the fundamentals of Apache Spark 2 and its new features. Effectively use Spark SQL, MLlib, RDDs, GraphX, and Spark Streaming to tackle real-world challenges. Gain skills in data processing, transformation, and analysis with Spark. Deploy and operate your Spark applications in clustered environments. Develop your own recommendation engines and predictive analytics models with Spark. Author(s) None Abbasi brings a wealth of expertise in Big Data technologies with a keen focus on simplifying complex concepts for learners. With substantial experience working in data processing frameworks, their approach to teaching creates an engaging and practical learning experience. With "Learning Apache Spark 2", None empowers readers to confidently tackle challenges in Big Data processing and analytics. Who is it for? This book is ideal for aspiring Big Data professionals seeking an accessible introduction to Apache Spark. Beginners in Spark will find step-by-step guidance, while those familiar with earlier versions will appreciate the insights into Spark 2's new features. Familiarity with Big Data concepts and Scala programming is recommended for optimal understanding.

Learning PySpark

2017-02-27 O'Reilly Amazon

book

Denny Lee , Tomasz Drabas

data data-engineering apache-spark PySpark AI/ML Big Data

"Learning PySpark" guides you through mastering the integration of Python with Apache Spark to build scalable and efficient data applications. You'll delve into Spark 2.0's architecture, efficiently process data, and explore PySpark's capabilities ranging from machine learning to structured streaming. By the end, you'll be equipped to craft and deploy robust data pipelines and applications. What this Book will help me do Master the Spark 2.0 architecture and its Python integration with PySpark. Leverage PySpark DataFrames and RDDs for effective data manipulation and analysis. Develop scalable machine learning models using PySpark's ML and MLlib libraries. Understand advanced PySpark features such as GraphFrames for graph processing and TensorFrames for deep learning models. Gain expertise in deploying PySpark applications locally and on the cloud for production-ready solutions. Author(s) Authors None Drabas and None Lee bring extensive experience in data engineering and Python programming. They combine a practical, example-driven approach with deep insights into Apache Spark's ecosystem. Their expertise and clarity in writing make this book accessible for individuals aiming to excel in big data technologies with Python. Who is it for? This book is best suited for Python developers who want to integrate Apache Spark 2.0 into their workflow to process large-scale data. Ideal readers will have foundational knowledge of Python and seek to build scalable data-intensive applications using Spark, regardless of prior experience with Spark itself.

Big Data Now: 2016 Edition

2017-02-15 O'Reilly Amazon

book

O'Reilly Media, Inc.

data data-engineering AI/ML Big Data Cloud Computing

Now in its sixth edition, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve examined throughout 2016. This collection of blog posts, authored by leading thinkers and experts in the field, reflects a unique set of themes we’ve identified as gaining significant attention and traction. Our list of topics for 2016 includes: Careers in data Tools and architecture for big data Intelligent real-time applications Cloud infrastructure Machine learning: models and training Deep learning and artificial intelligence

Apache Spark for Data Science Cookbook

2016-12-22 O'Reilly Amazon

book

Padma Priya Chitturi

data data-engineering apache-spark AI/ML Analytics Big Data

In "Apache Spark for Data Science Cookbook," you'll delve into solving real-world analytical challenges using the robust Apache Spark framework. This book features hands-on recipes that cover data analysis, distributed machine learning, and real-time data processing. You'll gain practical skills to process, visualize, and extract insights from large datasets efficiently. What this Book will help me do Master using Apache Spark for processing and analyzing large-scale datasets effectively. Harness Spark's MLLib for implementing machine learning algorithms like classification and clustering. Utilize libraries such as NumPy, SciPy, and Pandas in conjunction with Spark for numerical computations. Apply techniques like Natural Language Processing and text mining using Spark-integrated tools. Perform end-to-end data science workflows, including data exploration, modeling, and visualization. Author(s) Nagamallikarjuna Inelu and None Chitturi bring their extensive experience working with data science and distributed computing frameworks like Apache Spark. Nagamallikarjuna specializes in applying machine learning algorithms to big data problems, while None has contributed to various big data system implementations. Together, they focus on providing practitioners with practical and efficient solutions. Who is it for? This book is primarily intended for novice and intermediate data scientists and analysts who are curious about using Apache Spark to tackle data science problems. Readers are expected to have some familiarity with basic data science tasks. If you want to learn practical applications of Spark in data analysis and enhance your big data analytics skills, this resource is for you.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Data Analytics with Spark Using Python, First edition

Hands-On Data Warehousing with Azure Data Factory

Data Science Fundamentals for Python and MongoDB

Networking Design for HPC and AI on IBM Power Systems

Enhancing the IBM Power Systems Platform with IBM Watson Services

IBM Power System AC922 Introduction and Technical Overview

SQL Server 2017 Developer???s Guide

IBM PowerAI: Deep Learning Unleashed on IBM Power Systems Servers

Spark: The Definitive Guide

Data Warehousing in the Age of Artificial Intelligence

Introduction to GPUs for Data Analytics

Apache Spark 2.x Machine Learning Cookbook

Apache Spark 2.x for Java Developers

Mastering Apache Spark 2.x - Second Edition

Frank Kane's Taming Big Data with Apache Spark and Python

Advanced Analytics with Spark, 2nd Edition

Apache Spark 2.x Cookbook

Data Lake for Enterprises

High Performance Spark

Machine Learning with Spark - Second Edition

Mastering Spark for Data Science

Learning Apache Spark 2

Learning PySpark

Big Data Now: 2016 Edition

Apache Spark for Data Science Cookbook