O'Reilly Data Engineering Books

Usage-Driven Database Design: From Logical Data Modeling through Physical Schema Definition

2017-04-07 O'Reilly Amazon

book

George Tillmann

data data-engineering data-models Big Data Cassandra Data Modelling

Design great databases—from logical data modeling through physical schema definition. You will learn a framework that finally cracks the problem of merging data and process models into a meaningful and unified design that accounts for how data is actually used in production systems. Key to the framework is a method for taking the logical data model that is a static look at the definition of the data, and merging that static look with the process models describing how the data will be used in actual practice once a given system is implemented. The approach solves the disconnect between the static definition of data in the logical data model and the dynamic flow of the data in the logical process models. The design framework in this book can be used to create operational databases for transaction processing systems, or for data warehouses in support of decision support systems. The information manager can be a flat file, Oracle Database, IMS, NoSQL, Cassandra, Hadoop, or any other DBMS. Usage-Driven Database Design emphasizes practical aspects of design, and speaks to what works, what doesn't work, and what to avoid at all costs. Included in the book are lessons learned by the author over his 30+ years in the corporate trenches. Everything in the book is grounded on good theory, yet demonstrates a professional and pragmatic approach to design that can come only from decades of experience. Presents an end-to-end framework from logical data modeling through physical schema definition. Includes lessons learned, techniques, and tricks that can turn a database disaster into a success. Applies to all types of database management systems, including NoSQL such as Cassandra and Hadoop, and mainstream SQL databases such as Oracle and SQL Server What You'll Learn Create logical data models that accurately reflect the real world of the user Create usage scenarios reflecting how applications will use a new database Merge static data models with dynamic process models to create resilient yet flexible database designs Support application requirements by creating responsive database schemas in any database architecture Cope with big data and unstructured data for transaction processing and decision support systems Recognize when relational approaches won't work, and when to turn toward NoSQL solutions such as Cassandra or Hadoop Who This Book Is For System developers, including business analysts, database designers, database administrators, and application designers and developers who must design or interact with database systems

Mastering Spark for Data Science

2017-03-29 O'Reilly Amazon

book

Matthew Hallett , David George , Antoine Amend , Andrew Morgan

data data-engineering apache-spark AI/ML Analytics API

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced analytical techniques with Spark Learn how to tell a compelling story with data science using Spark’s ecosystem Explore data at scale and work with cutting edge data science methods Who This Book Is For This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes. What You Will Learn Learn the design patterns that integrate Spark into industrialized data science pipelines See how commercial data scientists design scalable code and reusable code for data science services Explore cutting edge data science methods so that you can study trends and causality Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs Find out how Spark can be used as a universal ingestion engine tool and as a web scraper Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams Study advanced Spark concepts, solution design patterns, and integration architectures Demonstrate powerful data science pipelines In Detail Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more. You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly. Style and approach This is an advanced guide for those with beginner-level familiarity with the Spark architecture and working with Data Science applications. Mastering Spark for Data Science is a practical tutorial that uses core Spark APIs and takes a deep dive into advanced libraries including: Spark SQL, visual streaming, and MLlib. This book expands on titles like: Machine Learning with Spark and Learning Spark. It is the next learning curve for those comfortable with Spark and looking to improve their skills.

Learning Apache Spark 2

2017-03-28 O'Reilly Amazon

book

Muhammad Asif Abbasi

data data-engineering apache-spark AI/ML Analytics Big Data

Dive into the world of Big Data with "Learning Apache Spark 2". This book introduces you to the powerful Apache Spark framework, tailored for real-time data analytics and machine learning. Through practical examples and real-world use-cases, you'll gain hands-on experience in leveraging Spark's capabilities for your data processing needs. What this Book will help me do Master the fundamentals of Apache Spark 2 and its new features. Effectively use Spark SQL, MLlib, RDDs, GraphX, and Spark Streaming to tackle real-world challenges. Gain skills in data processing, transformation, and analysis with Spark. Deploy and operate your Spark applications in clustered environments. Develop your own recommendation engines and predictive analytics models with Spark. Author(s) None Abbasi brings a wealth of expertise in Big Data technologies with a keen focus on simplifying complex concepts for learners. With substantial experience working in data processing frameworks, their approach to teaching creates an engaging and practical learning experience. With "Learning Apache Spark 2", None empowers readers to confidently tackle challenges in Big Data processing and analytics. Who is it for? This book is ideal for aspiring Big Data professionals seeking an accessible introduction to Apache Spark. Beginners in Spark will find step-by-step guidance, while those familiar with earlier versions will appreciate the insights into Spark 2's new features. Familiarity with Big Data concepts and Scala programming is recommended for optimal understanding.

Understanding Metadata

2017-03-15 O'Reilly Amazon

book

Scott Gidley , Federico Castanedo

data data-engineering metadata Big Data Data Governance Data Lake

One viable option for organizations looking to harness massive amounts of data is the data lake, a single repository for storing all the raw data, both structured and unstructured, that floods into the company. But that isn’t the end of the story. The key to making a data lake work is data governance, using metadata to provide valuable context through tagging and cataloging. This practical report examines why metadata is essential for managing, migrating, accessing, and deploying any big data solution. Authors Federico Castanedo and Scott Gidley dive into the specifics of analyzing metadata for keeping track of your data—where it comes from, where it’s located, and how it’s being used—so you can provide safeguards and reduce risk. In the process, you’ll learn about methods for automating metadata capture. This report also explains the main features of a data lake architecture, and discusses the pros and cons of several data lake management solutions that support metadata. These solutions include: Traditional data integration/management vendors such as the IBM Research Accelerated Discovery Lab Tooling from open source projects, including Teradata Kylo and Informatica Startups such as Trifacta and Zaloni that provide best of breed technology

Learning PySpark

2017-02-27 O'Reilly Amazon

book

Denny Lee , Tomasz Drabas

data data-engineering apache-spark PySpark AI/ML Big Data

"Learning PySpark" guides you through mastering the integration of Python with Apache Spark to build scalable and efficient data applications. You'll delve into Spark 2.0's architecture, efficiently process data, and explore PySpark's capabilities ranging from machine learning to structured streaming. By the end, you'll be equipped to craft and deploy robust data pipelines and applications. What this Book will help me do Master the Spark 2.0 architecture and its Python integration with PySpark. Leverage PySpark DataFrames and RDDs for effective data manipulation and analysis. Develop scalable machine learning models using PySpark's ML and MLlib libraries. Understand advanced PySpark features such as GraphFrames for graph processing and TensorFrames for deep learning models. Gain expertise in deploying PySpark applications locally and on the cloud for production-ready solutions. Author(s) Authors None Drabas and None Lee bring extensive experience in data engineering and Python programming. They combine a practical, example-driven approach with deep insights into Apache Spark's ecosystem. Their expertise and clarity in writing make this book accessible for individuals aiming to excel in big data technologies with Python. Who is it for? This book is best suited for Python developers who want to integrate Apache Spark 2.0 into their workflow to process large-scale data. Ideal readers will have foundational knowledge of Python and seek to build scalable data-intensive applications using Spark, regardless of prior experience with Spark itself.

Mastering Elasticsearch 5.x - Third Edition

2017-02-21 O'Reilly Amazon

book

Bharvi Dixit

data data-engineering search elasticsearch Analytics Big Data

This comprehensive guide dives deep into the functionalities of Elasticsearch 5, the widely-used search and analytics engine. Leveraging the power of Apache Lucene, this book will help you understand advanced concepts like querying, indexing, and cluster management to build efficient and scalable search solutions. What this Book will help me do Master advanced features of Elasticsearch such as text scoring, sharding, and aggregation. Understand how to handle big data efficiently using Elasticsearch's architecture. Learn practical implementation techniques for Elasticsearch features through hands-on examples. Develop custom plugins for Elasticsearch to tailor its functionalities to specific needs. Scale and optimize Elasticsearch clusters for high performance in production environments. Author(s) Bharvi Dixit is an experienced software engineer and a recognized expert in implementing Elasticsearch solutions. With a strong background in distributed systems and database management, Bharvi's writing is informed by real-world experience and a focus on practical applications. Who is it for? This book is ideal for developers and data engineers with existing experience in Elasticsearch who wish to deepen their knowledge. It serves as a valuable resource for professionals tasked with creating scalable search applications. A working understanding of Elasticsearch basics and query DSL is recommended to fully benefit from this guide.

Big Data Now: 2016 Edition

2017-02-15 O'Reilly Amazon

book

O'Reilly Media, Inc.

data data-engineering AI/ML Big Data Cloud Computing

Now in its sixth edition, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve examined throughout 2016. This collection of blog posts, authored by leading thinkers and experts in the field, reflects a unique set of themes we’ve identified as gaining significant attention and traction. Our list of topics for 2016 includes: Careers in data Tools and architecture for big data Intelligent real-time applications Cloud infrastructure Machine learning: models and training Deep learning and artificial intelligence

Geospatial Data and Analysis

2017-02-15 O'Reilly Amazon

book

Jon Bruner , Bill Day , Aurelia Moser

data data-engineering location-data geographic-information-system-gis geographic information system (gis) Analytics

Geospatial data, or data with location information, is generated in huge volumes every day by billions of mobile phones, IoT sensors, drones, nanosatellites, and many other sources in an unending stream. This practical ebook introduces you to the landscape of tools and methods for making sense of all that data, and shows you how to apply geospatial analytics to a variety of issues, large and small. Authors Aurelia Moser, Jon Bruner, and Bill Day provide a complete picture of the geospatial analysis options available, including low-scale commercial desktop GIS tools, medium-scale options such as PostGIS and Lucene-based searching, and true big data solutions built on technologies such as Hadoop. You’ll learn when it makes sense to move from one type of solution to the next, taking increased costs and complexity into account. Explore the structure of basic webmaps, and the challenges and constraints involved when working with geo data Dive into low- to medium-scale mapping tools for use in backend and frontend web development Focus on tools for robust medium-scale geospatial projects that don’t quite justify a big data solution Learn about innovative platforms and software packages for solving issues of processing and storage of large-scale data Examine geodata analysis use cases, including disaster relief, urban planning, and agriculture and environmental monitoring

Elasticsearch 5.x Cookbook - Third Edition

2017-02-06 O'Reilly Amazon

book

Alberto Paro

data data-engineering search elasticsearch Analytics Big Data

Elasticsearch 5.x Cookbook is a comprehensive guide that teaches you how to leverage the full power of Elasticsearch for high-performance search and analytics. Through step-by-step recipes, you'll explore deployment, query building, plugin integration, and advanced analytics, ensuring you can manage and scale Elasticsearch like a pro. What this Book will help me do Understand and deploy complex Elasticsearch cluster topologies for optimal performance. Create tailored mappings to gain finer control over data indexing and retrieval. Design and execute advanced queries and analytics using Elasticsearch capabilities. Integrate Elasticsearch with popular programming languages and big data platforms. Monitor and improve Elasticsearch cluster health using the best practices and tools. Author(s) Alberto Paro is a seasoned software engineer and data scientist with extensive experience in distributed systems and search technologies. Having worked on numerous search-related projects, he brings practical, real-world insights to his writing. Alberto is passionate about teaching and simplifying complex concepts, making this book both approachable and expertly detailed. Who is it for? This book is ideal for developers or data engineers seeking to utilize Elasticsearch for advanced search and analytics tasks. If you have some prior knowledge of JSON and programming concepts, particularly Java, you will benefit most from this material. Whether you're looking to integrate Elasticsearch into your systems or to optimize its usage, this book caters to your needs.

HBase High Performance Cookbook

2017-01-31 O'Reilly Amazon

book

Ruchir Choudhry

data data-engineering nosql-databases Apache HBase Big Data Cloud Computing

"HBase High Performance Cookbook" is your guide to mastering the optimization, scaling, and tuning of HBase systems. Covering everything from configuring HBase clusters to designing scalable table structures and performance tuning, this comprehensive book provides practical advice and strategies for leveraging HBase's full potential. By following this book's recipes, you'll supercharge your HBase expertise. What this Book will help me do Understand how to configure HBase for optimal performance, improving your data system's efficiency. Learn to design table structures to maximize scalability and functionality in HBase. Gain skills in performing CRUD operations and using advanced features like MapReduce within HBase. Discover practices for integrating HBase with other technologies such as ElasticSearch. Master the steps involved in setting up and optimizing HBase in cloud environments for enhanced performance. Author(s) Ruchir Choudhry is a seasoned data management professional with extensive experience in distributed database systems. He possesses deep expertise in HBase, Hadoop, and other big data technologies. His practical and engaging writing style aims to demystify complex technical topics, making them accessible to developers and architects alike. Who is it for? This book is tailored for developers and system architects looking to deepen their understanding of HBase. Whether you are experienced with other NoSQL databases or are new to HBase, this book provides extensive practical knowledge. Ideal for professionals working in big data applications or those eager to optimize and scale their database systems effectively.

Pro Apache Phoenix: An SQL Driver for HBase, First Edition

2016-12-29 O'Reilly Amazon

book

Ravi Magham , Shakil Akhtar

data data-engineering nosql-databases Apache HBase API Big Data

Leverage Phoenix as an ANSI SQL engine built on top of the highly distributed and scalable NoSQL framework HBase. Learn the basics and best practices that are being adopted in Phoenix to enable a high write and read throughput in a big data space. This book includes real-world cases such as Internet of Things devices that send continuous streams to Phoenix, and the book explains how key features such as joins, indexes, transactions, and functions help you understand the simple, flexible, and powerful API that Phoenix provides. Examples are provided using real-time data and data-driven businesses that show you how to collect, analyze, and act in seconds. Pro Apache Phoenix covers the nuances of setting up a distributed HBase cluster with Phoenix libraries, running performance benchmarks, configuring parameters for production scenarios, and viewing the results. The book also shows how Phoenix plays well with other key frameworks in the Hadoop ecosystem such as Apache Spark, Pig, Flume, and Sqoop. You will learn how to: Handle a petabyte data store by applying familiar SQL techniques Store, analyze, and manipulate data in a NoSQL Hadoop echo system with HBase Apply best practices while working with a scalable data store on Hadoop and HBase Integrate popular frameworks (Apache Spark, Pig, Flume) to simplify big data analysis Demonstrate real-time use cases and big data modeling techniques Who This Book Is For Data engineers, Big Data administrators, and architects

Apache Spark for Data Science Cookbook

2016-12-22 O'Reilly Amazon

book

Padma Priya Chitturi

data data-engineering apache-spark AI/ML Analytics Big Data

In "Apache Spark for Data Science Cookbook," you'll delve into solving real-world analytical challenges using the robust Apache Spark framework. This book features hands-on recipes that cover data analysis, distributed machine learning, and real-time data processing. You'll gain practical skills to process, visualize, and extract insights from large datasets efficiently. What this Book will help me do Master using Apache Spark for processing and analyzing large-scale datasets effectively. Harness Spark's MLLib for implementing machine learning algorithms like classification and clustering. Utilize libraries such as NumPy, SciPy, and Pandas in conjunction with Spark for numerical computations. Apply techniques like Natural Language Processing and text mining using Spark-integrated tools. Perform end-to-end data science workflows, including data exploration, modeling, and visualization. Author(s) Nagamallikarjuna Inelu and None Chitturi bring their extensive experience working with data science and distributed computing frameworks like Apache Spark. Nagamallikarjuna specializes in applying machine learning algorithms to big data problems, while None has contributed to various big data system implementations. Together, they focus on providing practitioners with practical and efficient solutions. Who is it for? This book is primarily intended for novice and intermediate data scientists and analysts who are curious about using Apache Spark to tackle data science problems. Readers are expected to have some familiarity with basic data science tasks. If you want to learn practical applications of Spark in data analysis and enhance your big data analytics skills, this resource is for you.

Fast Data Processing Systems with SMACK Stack

2016-12-22 O'Reilly Amazon

book

Raúl Estrada

data data-engineering smack-stack Big Data Cassandra Kafka

Fast Data Processing Systems with SMACK Stack introduces you to the SMACK stack-a combination of Spark, Mesos, Akka, Cassandra, and Kafka. You will learn to integrate these technologies to build scalable, efficient, and real-time data processing platforms tailored for solving critical business challenges. What this Book will help me do Understand the concepts of fast data pipelines and design scalable architectures using the SMACK stack Gain expertise in functional programming with Scala and leverage its power in data processing tasks Build and optimize distributed databases using Apache Cassandra for scaling extensively Deploy and manage real-time data streams using Apache Kafka to handle massive messaging workloads Implement cost-effective cluster infrastructures with Apache Mesos for efficient resource utilization Author(s) None Estrada is an expert in distributed systems and big data technologies. With years of experience implementing SMACK-based solutions across industries, Estrada offers a practical viewpoint to designing scalable systems. Their blend of theoretical knowledge and applied practices ensures readers receive actionable guidance. Who is it for? This book is perfect for software developers, data engineers, or data scientists looking to deepen their understanding of real-time data processing systems. If you have a foundational knowledge of the technologies in the SMACK stack or wish to learn how to combine these cutting-edge tools to solve complex problems, this is for you. Readers with an interest in building efficient big data solutions will find tremendous value here.

Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

2016-12-12 O'Reilly Amazon

book

Casey Stella , Douglas Eadline , Ofer Mendelevitch

data data-engineering Hadoop AI/ML Analytics Big Data

The Complete Guide to Data Science with Hadoop—For Technical Professionals, Businesspeople, and Students Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials. Practical Data Science with Hadoop® and Spark The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization. Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP). This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives. Learn What data science is, how it has evolved, and how to plan a data science career How data volume, variety, and velocity shape data science use cases Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark Data importation with Hive and Spark Data quality, preprocessing, preparation, and modeling Visualization: surfacing insights from huge data sets Machine learning: classification, regression, clustering, and anomaly detection Algorithms and Hadoop tools for predictive modeling Cluster analysis and similarity functions Large-scale anomaly detection NLP: applying data science to human language

The Big Data Transformation

2016-11-15 O'Reilly Amazon

book

Ashish Thusoo

data data-engineering Analytics Big Data Data Analytics DWH

Business executives today are well aware of the power of data, especially for gaining actionable insight into products and services. But how do you jump into the big data analytics game without spending millions on data warehouse solutions you don’t need? This 40-page report focuses on massively parallel processing (MPP) analytical databases that enable you to run queries and dashboards on a variety of business metrics at extreme speed and Exabyte scale. Because they leverage the full computational power of a cluster, MPP analytical databases can analyze massive volumes of data—both structured and semi-structured—at unprecedented speeds. This report presents five real-world case studies from Etsy, Cerner Corporation, Criteo and other global enterprises to focus on one big data analytics platform in particular, HPE Vertica. You’ll discover: How one prominent data storage company convinced both business and tech stakeholders to adopt an MPP analytical database Why performance marketing technology company Criteo used a Center of Excellence (CoE) model to ensure the success of its big data analytics endeavors How YPSM uses Vertica to speed up its Hadoop-based data processing environment Why Cerner adopted an analytical database to scale its highly successful health information technology platform How Etsy drives success with the company’s big data initiative by avoiding common technical and organizational mistakes

Beginning Hibernate: For Hibernate 5

2016-11-10 O'Reilly Amazon

book

Dave Minter , Joseph B. Ottinger , Jeff Linwood

data data-engineering database-management-tools object-relational-mapping hibernate Big Data

Get started with the Hibernate 5 persistence layer and gain a clear introduction to the current standard for object-relational persistence in Java. This updated edition includes the new Hibernate 5.0 framework as well as coverage of NoSQL, MongoDB, and other related technologies, ranging from applications to big data. Beginning Hibernate is ideal if you're experienced in Java with databases (the traditional, or connected, approach), but new to open-source, lightweight Hibernate. The book keeps its focus on Hibernate without wasting time on nonessential third-party tools, so you'll be able to immediately start building transaction-based engines and applications. Experienced authors Joseph Ottinger with Dave Minter and Jeff Linwood provide more in-depth examples than any other book for Hibernate beginners. They present their material in a lively, example-based manner—not a dry, theoretical, hard-to-read fashion. What You'll Learn Build enterprise Java-based transaction-type applications that access complex data with Hibernate Work with Hibernate 5 using a present-day build process Use Java 8 features with Hibernate Integrate into the persistence life cycle Map using Java's annotations Search and query with the new version of Hibernate Integrate with MongoDB using NoSQL Keep track of versioned data with Hibernate Envers Who This Book Is For Experienced Java developers interested in learning how to use and apply object-relational persistence in Java and who are new to the Hibernate persistence framework.

Oracle R Enterprise: Harnessing the Power of R in Oracle Database

2016-11-04 O'Reilly Amazon

book

Brendan Tierney

data data-engineering oracle-database-solutions Analytics API Big Data

Master the Big Data Capabilities of Oracle R Enterprise Effectively manage your enterprise’s big data and keep complex processes running smoothly using the hands-on information contained in this Oracle Press guide. Oracle R Enterprise: Harnessing the Power of R in Oracle Database shows, step-by-step, how to create and execute large-scale predictive analytics and maintain superior performance. Discover how to explore and prepare your data, accurately model business processes, generate sophisticated graphics, and write and deploy powerful scripts. You will also find out how to effectively incorporate Oracle R Enterprise features in APEX applications, OBIEE dashboards, and Apache Hadoop systems. Learn to: • Install, configure, and administer Oracle R Enterprise • Establish connections and move data to the database • Create Oracle R Enterprise packages and functions • Use the R language to work with data in Oracle Database • Build models using ODM, ORE, and other algorithms • Develop and deploy R scripts and use the R script repository • Execute embedded R scripts and employ ORE SQL API functions • Map and manipulate data using Oracle R Advanced Analytics for Hadoop • Use ORE in Oracle Data Miner, OBIEE, and other applications

Spark in Action

2016-11-03 O'Reilly Amazon

book

Marko Bonaci , Petar Zecevic

data data-engineering apache-spark AI/ML Analytics API

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0. About the Technology Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades. About the Book Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You'll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you'll start programming Spark using its core APIs. Along the way, you'll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book's code. What's Inside Updated for Spark 2.0 Real-life case studies Spark DevOps with Docker Examples in Scala, and online in Java and Python About the Reader Written for experienced programmers with some background in big data or machine learning. About the Authors Petar Zečević and Marko Bonaći are seasoned developers heavily involved in the Spark community. Quotes Dig in and get your hands dirty with one of the hottest data processing engines today. A great guide. - Jonathan Sharley, Pandora Media Must-have! Speed up your learning of Spark as a distributed computing framework. - Robert Ormandi, Yahoo! An easy-to-follow, step-by-step guide. - Gaurav Bhardwaj, 3Pillar Global An ambitiously comprehensive overview of Spark and its diverse ecosystem. - Jonathan Miller, Optensity

Fast Data Processing with Spark 2 - Third Edition

2016-10-24 O'Reilly Amazon

book

Krishna Sankar , Holden Karau

data data-engineering apache-spark AI/ML Analytics API

Fast Data Processing with Spark 2 takes you through the essentials of leveraging Spark for big data analysis. You will learn how to install and set up Spark, handle data using its APIs, and apply advanced functionality like machine learning and graph processing. By the end of the book, you will be well-equipped to use Spark in real-world data processing tasks. What this Book will help me do Install and configure Apache Spark for optimal performance. Interact with distributed datasets using the resilient distributed dataset (RDD) API. Leverage the flexibility of DataFrame API for efficient big data analytics. Apply machine learning models using Spark MLlib to solve complex problems. Perform graph analysis using GraphX to uncover structural insights in data. Author(s) Krishna Sankar is an experienced data scientist and thought leader in big data technologies. With a deep understanding of machine learning, distributed systems, and Apache Spark, Krishna has guided numerous projects in data engineering and big data processing. Matei Zaharia, the co-author, is also widely recognized in the field of distributed systems and cloud computing, contributing to Apache Spark development. Who is it for? This book is catered to software developers and data engineers with a foundational understanding of Scala or Java programming. Beginner to medium-level understanding of big data processing concepts is recommended for readers. If you are aspiring to solve big data problems using scalable distributed computing frameworks, this book is perfect for you. By the end, you will be confident in building Spark-powered applications and analyzing data efficiently.

Fast Data Architectures for Streaming Applications

2016-10-15 O'Reilly Amazon

book

Dean Wampler

data data-engineering streaming-messaging streaming-architecture Big Data ELK

Why have stream-oriented data systems become so popular, when batch-oriented systems have served big data needs for many years? In this report, author Dean Wampler examines the rise of streaming systems for handling time-sensitive problems—such as detecting fraudulent financial activity as it happens. You’ll explore the characteristics of fast data architectures, along with several open source tools for implementing them. Batch-mode processing isn’t going away, but exclusive use of these systems is now a competitive disadvantage. You’ll learn that, while fast data architectures are much harder to build, they represent the state of the art for dealing with mountains of data that require immediate attention. Learn step-by-step how a basic fast data architecture works Understand why event logs are the core abstraction for streaming architectures, while message queues are the core integration tool Use methods for analyzing infinite data sets, where you don’t have all the data and never will Take a tour of open source streaming engines, and discover which ones work best for different use cases Get recommendations for making real-world streaming systems responsive, resilient, elastic, and message driven Explore an example streaming application for the IoT: telemetry ingestion and anomaly detection for home automation systems

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

2016-10-01 O'Reilly Amazon

book

Deepak Vohra

data data-engineering Hadoop Big Data Apache HBase HDFS

Learn how to use the Apache Hadoop projects, including MapReduce, HDFS, Apache Hive, Apache HBase, Apache Kafka, Apache Mahout, and Apache Solr. From setting up the environment to running sample applications each chapter in this book is a practical tutorial on using an Apache Hadoop ecosystem project. While several books on Apache Hadoop are available, most are based on the main projects, MapReduce and HDFS, and none discusses the other Apache Hadoop ecosystem projects and how they all work together as a cohesive big data development platform. What You Will Learn: Set up the environment in Linux for Hadoop projects using Cloudera Hadoop Distribution CDH 5 Run a MapReduce job Store data with Apache Hive, and Apache HBase Index data in HDFS with Apache Solr Develop a Kafka messaging system Stream Logs to HDFS with Apache Flume Transfer data from MySQL database to Hive, HDFS, and HBase with Sqoop Create a Hive table over Apache Solr Develop a Mahout User Recommender System Who This Book Is For: Apache Hadoop developers. Pre-requisite knowledge of Linux and some knowledge of Hadoop is required.

Hadoop Blueprints

2016-09-30 O'Reilly Amazon

book

Anurag Shrivastava , Sudheesh Narayan , Tanmay Deshpande

data data-engineering Hadoop Big Data Java Marketing

"Hadoop Blueprints" guides you through using Hadoop and its ecosystem to solve real-life business problems. You will explore six case studies covering areas like fraud detection, marketing analysis, and data lakes, providing a thorough and practical understanding of Hadoop applications. What this Book will help me do Understand how to use Hadoop to solve real-life business scenarios effectively. Learn to build a 360-degree customer view integrating different data types. Develop and deploy a fraud detection system leveraging Hadoop technologies. Explore marketing campaign analysis and improvement using data-driven workflows on Hadoop. Gain hands-on experience with creating and maintaining efficient data lakes. Author(s) Sudheesh Narayan, along with his co-authors Anurag Shrivastava and Nod Deshpande, brings extensive experience in Big Data technologies. They have been involved in developing solutions utilizing Hadoop, Apache Spark, and other ecosystem components. Their practical approach to presenting complex technical topics ensures readers can apply their knowledge to real-world scenarios. Who is it for? This book is ideal for software developers, data engineers, and IT professionals who have a foundational understanding of Hadoop and seek to expand their practical skills. Readers should be familiar with Java or other scripting languages. It's perfect for those aiming to build actionable solutions for business problems using Big Data technologies.

Practical Oracle E-Business Suite: An Implementation and Management Guide

2016-09-30 O'Reilly Amazon

book

Syed Zaheer , Erman Arslan

data data-engineering oracle-database-solutions Big Data Oracle

Learn to build and implement a robust Oracle E-Business Suite system using the new release, EBS 12.2. This hands-on, real-world guide explains the rationale for using an Oracle E-Business Suite environment in a business enterprise and covers the major technology stack changes from EBS version 11i through R12.2. You will learn to build up an EBS environment from a simple single-node installation to a complex multi-node high available setup. Practical Oracle E-Business Suite focuses on release R12.2, but key areas in R12.1 are also covered wherever necessary. Detailed instructions are provided for the installation of EBS R12.2 in single and multi-node configurations, the logic and methodology used in EBS patching, and cloning of EBS single-node and complex multi-node environments configured with RAC. This book also provides information on FMW used in EBS 12.2, as well as performance tuning and EBS 12.2 on engineered system implementations. Understand Oracle EBS software and the underlying technology stack components Install/configure Oracle E-Business Suite R12.2 in simple and HA complex setups Manage Oracle EBS 12.2 Use online patching (adop) for Installation of Oracle EBS patches Clone an EBS environment in simple and complex configurations Perform and tune Oracle EBS in all layers (Application/DB/OS/NW) Secure E-Business Suite R12.2 Who This Book Is For: Developers, data architects, and data scientists looking to integrate the most successful big data open stack architecture and how to choose the correct technology in every layer

Spark for Data Science

2016-09-30 O'Reilly Amazon

book

Bikramaditya Singhal , Srinivas Duvvuri

data data-engineering apache-spark AI/ML Analytics API

Explore how to leverage Apache Spark for efficient big data analytics and machine learning solutions in "Spark for Data Science". This detailed guide provides you with the skills to process massive datasets, perform data analytics, and build predictive models using Spark's powerful tools like RDDs, DataFrames, and Datasets. What this Book will help me do Gain expertise in data processing and transformation with Spark. Perform advanced statistical analysis to uncover insights. Master machine learning techniques to create predictive models using Spark. Utilize Spark's APIs to process and visualize big data. Build scalable and efficient data science solutions. Author(s) This book is co-authored by None Singhal and None Duvvuri, both accomplished data scientists with extensive experience in Apache Spark and big data technologies. They bring their practical industry expertise to explain complex topics in a straightforward manner. Their writing emphasizes real-world applications and step-by-step procedural guidance, making this a valuable resource for learners. Who is it for? This book is ideally suited for technologists seeking to incorporate data science capabilities into their work with Apache Spark, data scientists interested in machine learning algorithms implemented in Spark, and beginners aiming to step into the field of big data analytics. Whether you are familiar with Spark or completely new, this book offers valuable insights and practical knowledge.

Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka

2016-09-29 O'Reilly Amazon

book

Isaac Ruiz , Raul Estrada

data data-engineering Big Data Cassandra Docker Kafka

Learn how to integrate full-stack open source big data architecture and to choose the correct technology—Scala/Spark, Mesos, Akka, Cassandra, and Kafka—in every layer. Big data architecture is becoming a requirement for many different enterprises. So far, however, the focus has largely been on collecting, aggregating, and crunching large data sets in a timely manner. In many cases now, organizations need more than one paradigm to perform efficient analyses. Big Data SMACK explains each of the full-stack technologies and, more importantly, how to best integrate them. It provides detailed coverage of the practical benefits of these technologies and incorporates real-world examples in every situation. This book focuses on the problems and scenarios solved by the architecture, as well as the solutions provided by every technology. It covers the six main concepts of big data architecture and how integrate, replace, and reinforce every layer: What You'll Learn The language: Scala The engine: Spark (SQL, MLib, Streaming, GraphX) The container: Mesos, Docker The view: Akka The storage: Cassandra The message broker: Kafka What You Will Learn: Make big data architecture without using complex Greek letter architectures Build a cheap but effective cluster infrastructure Make queries, reports, and graphs that business demands Manage and exploit unstructured and No-SQL data sources Use tools to monitor the performance of your architecture Integrate all technologies and decide which ones replace and which ones reinforce Who This Book Is For Developers, data architects, and data scientists looking to integrate the most successful big data open stack architecture and to choose the correct technology in every layer

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Usage-Driven Database Design: From Logical Data Modeling through Physical Schema Definition

Mastering Spark for Data Science

Learning Apache Spark 2

Understanding Metadata

Learning PySpark

Mastering Elasticsearch 5.x - Third Edition

Big Data Now: 2016 Edition

Geospatial Data and Analysis

Elasticsearch 5.x Cookbook - Third Edition

HBase High Performance Cookbook

Pro Apache Phoenix: An SQL Driver for HBase, First Edition

Apache Spark for Data Science Cookbook

Fast Data Processing Systems with SMACK Stack

Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

The Big Data Transformation

Beginning Hibernate: For Hibernate 5

Oracle R Enterprise: Harnessing the Power of R in Oracle Database

Spark in Action

Fast Data Processing with Spark 2 - Third Edition

Fast Data Architectures for Streaming Applications

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Hadoop Blueprints

Practical Oracle E-Business Suite: An Implementation and Management Guide

Spark for Data Science

Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka