O'Reilly Data Engineering Books

Mastering Spark for Data Science

2017-03-29 O'Reilly Amazon

book

Matthew Hallett , David George , Antoine Amend , Andrew Morgan

data data-engineering apache-spark AI/ML Analytics API

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced analytical techniques with Spark Learn how to tell a compelling story with data science using Spark’s ecosystem Explore data at scale and work with cutting edge data science methods Who This Book Is For This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes. What You Will Learn Learn the design patterns that integrate Spark into industrialized data science pipelines See how commercial data scientists design scalable code and reusable code for data science services Explore cutting edge data science methods so that you can study trends and causality Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs Find out how Spark can be used as a universal ingestion engine tool and as a web scraper Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams Study advanced Spark concepts, solution design patterns, and integration architectures Demonstrate powerful data science pipelines In Detail Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more. You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly. Style and approach This is an advanced guide for those with beginner-level familiarity with the Spark architecture and working with Data Science applications. Mastering Spark for Data Science is a practical tutorial that uses core Spark APIs and takes a deep dive into advanced libraries including: Spark SQL, visual streaming, and MLlib. This book expands on titles like: Machine Learning with Spark and Learning Spark. It is the next learning curve for those comfortable with Spark and looking to improve their skills.

Learning Apache Spark 2

2017-03-28 O'Reilly Amazon

book

Muhammad Asif Abbasi

data data-engineering apache-spark AI/ML Analytics Big Data

Dive into the world of Big Data with "Learning Apache Spark 2". This book introduces you to the powerful Apache Spark framework, tailored for real-time data analytics and machine learning. Through practical examples and real-world use-cases, you'll gain hands-on experience in leveraging Spark's capabilities for your data processing needs. What this Book will help me do Master the fundamentals of Apache Spark 2 and its new features. Effectively use Spark SQL, MLlib, RDDs, GraphX, and Spark Streaming to tackle real-world challenges. Gain skills in data processing, transformation, and analysis with Spark. Deploy and operate your Spark applications in clustered environments. Develop your own recommendation engines and predictive analytics models with Spark. Author(s) None Abbasi brings a wealth of expertise in Big Data technologies with a keen focus on simplifying complex concepts for learners. With substantial experience working in data processing frameworks, their approach to teaching creates an engaging and practical learning experience. With "Learning Apache Spark 2", None empowers readers to confidently tackle challenges in Big Data processing and analytics. Who is it for? This book is ideal for aspiring Big Data professionals seeking an accessible introduction to Apache Spark. Beginners in Spark will find step-by-step guidance, while those familiar with earlier versions will appreciate the insights into Spark 2's new features. Familiarity with Big Data concepts and Scala programming is recommended for optimal understanding.

SQL Server 2016 Developer's Guide

2017-03-22 O'Reilly Amazon

book

Milo≈° Radivojeviƒá , William Durkin , Dejan Sarka

data data-engineering relational-databases microsoft-sql-server Analytics JSON

SQL Server 2016 Developer's Guide provides an in-depth overview of the new features and enhancements introduced in SQL Server 2016 that can significantly improve your development process. This book covers robust techniques for building high-performance, secure database applications while leveraging cutting-edge functionalities such as Stretch Database, temporal tables, and enhanced In-Memory OLTP capabilities. What this Book will help me do Master the new development features introduced in SQL Server 2016 and understand their applications. Use In-Memory OLTP enhancements to significantly boost application performance. Efficiently manage and analyze data using temporal tables and JSON integration. Explore SQL Server security enhancements to ensure data safety and access control. Gain insights into integrating R with SQL Server 2016 for advanced analytics. Author(s) None Radivojević, Dejan Sarka, and William Durkin are experienced database developers and architects with a strong focus on SQL Server technologies. They bring years of practical experience and a clear, insightful approach to teaching complex concepts. Their expertise shines in this comprehensive guide, providing readers with both foundational knowledge and advanced techniques. Who is it for? This guide is perfect for database developers and solution architects looking to harness the full potential of SQL Server 2016's new features. It's intended for professionals with prior experience in SQL Server or similar platforms who aim to develop efficient, high-performance applications. You'll benefit from this book if you are keen to master SQL Server 2016 and elevate your development skills.

Mastering Elastic Stack

2017-02-28 O'Reilly Amazon

book

Ravi Kumar Gupta , Yuvraj Gupta

data data-engineering search elasticsearch elastic-stack-elk-stack elastic stack (elk stack)

Mastering Elastic Stack is your complete guide to advancing your data analytics expertise using the ELK Stack. With detailed coverage of Elasticsearch, Logstash, Kibana, Beats, and X-Pack, this book equips you with the skills to process and analyze any type of data efficiently. Through practical examples and real-world scenarios, you'll gain the ability to build end-to-end pipelines and create insightful dashboards. What this Book will help me do Build and manage log pipelines using Logstash, Beats, and Elasticsearch for real-time analytics. Develop advanced Kibana dashboards to visualize and interpret complex datasets. Efficiently utilize X-Pack features for alerting, monitoring, and security in the Elastic Stack. Master plugin customization and deployment for a tailored Elastic Stack environment. Apply Elastic Stack solutions to real-world cases for centralized logging and actionable insights. Author(s) The authors, None Kumar Gupta and None Gupta, are experienced technologists who have spent years working at the forefront of data processing and analytics. They are well-versed in Elasticsearch, Logstash, Kibana, and the Elastic ecosystem, having worked extensively in enterprise environments where these tools have transformed operations. Their passion for teaching and thorough understanding of the tools culminate in this comprehensive resource. Who is it for? The ideal reader is a developer already familiar with Elasticsearch, Logstash, and Kibana who wants to deepen their understanding of the stack. If you're involved in creating scalable data pipelines, analyzing complex datasets, or looking to implement centralized logging solutions in your work, this book is an excellent resource. It bridges the gap from intermediate to expert knowledge, allowing you to use the Elastic Stack effectively in various scenarios. Whether you are transitioning from a beginner or enhancing your skill set, this book meets your needs.

Mastering Elasticsearch 5.x - Third Edition

2017-02-21 O'Reilly Amazon

book

Bharvi Dixit

data data-engineering search elasticsearch Analytics Big Data

This comprehensive guide dives deep into the functionalities of Elasticsearch 5, the widely-used search and analytics engine. Leveraging the power of Apache Lucene, this book will help you understand advanced concepts like querying, indexing, and cluster management to build efficient and scalable search solutions. What this Book will help me do Master advanced features of Elasticsearch such as text scoring, sharding, and aggregation. Understand how to handle big data efficiently using Elasticsearch's architecture. Learn practical implementation techniques for Elasticsearch features through hands-on examples. Develop custom plugins for Elasticsearch to tailor its functionalities to specific needs. Scale and optimize Elasticsearch clusters for high performance in production environments. Author(s) Bharvi Dixit is an experienced software engineer and a recognized expert in implementing Elasticsearch solutions. With a strong background in distributed systems and database management, Bharvi's writing is informed by real-world experience and a focus on practical applications. Who is it for? This book is ideal for developers and data engineers with existing experience in Elasticsearch who wish to deepen their knowledge. It serves as a valuable resource for professionals tasked with creating scalable search applications. A working understanding of Elasticsearch basics and query DSL is recommended to fully benefit from this guide.

Geospatial Data and Analysis

2017-02-15 O'Reilly Amazon

book

Jon Bruner , Bill Day , Aurelia Moser

data data-engineering location-data geographic-information-system-gis geographic information system (gis) Analytics

Geospatial data, or data with location information, is generated in huge volumes every day by billions of mobile phones, IoT sensors, drones, nanosatellites, and many other sources in an unending stream. This practical ebook introduces you to the landscape of tools and methods for making sense of all that data, and shows you how to apply geospatial analytics to a variety of issues, large and small. Authors Aurelia Moser, Jon Bruner, and Bill Day provide a complete picture of the geospatial analysis options available, including low-scale commercial desktop GIS tools, medium-scale options such as PostGIS and Lucene-based searching, and true big data solutions built on technologies such as Hadoop. You’ll learn when it makes sense to move from one type of solution to the next, taking increased costs and complexity into account. Explore the structure of basic webmaps, and the challenges and constraints involved when working with geo data Dive into low- to medium-scale mapping tools for use in backend and frontend web development Focus on tools for robust medium-scale geospatial projects that don’t quite justify a big data solution Learn about innovative platforms and software packages for solving issues of processing and storage of large-scale data Examine geodata analysis use cases, including disaster relief, urban planning, and agriculture and environmental monitoring

Elasticsearch 5.x Cookbook - Third Edition

2017-02-06 O'Reilly Amazon

book

Alberto Paro

data data-engineering search elasticsearch Analytics Big Data

Elasticsearch 5.x Cookbook is a comprehensive guide that teaches you how to leverage the full power of Elasticsearch for high-performance search and analytics. Through step-by-step recipes, you'll explore deployment, query building, plugin integration, and advanced analytics, ensuring you can manage and scale Elasticsearch like a pro. What this Book will help me do Understand and deploy complex Elasticsearch cluster topologies for optimal performance. Create tailored mappings to gain finer control over data indexing and retrieval. Design and execute advanced queries and analytics using Elasticsearch capabilities. Integrate Elasticsearch with popular programming languages and big data platforms. Monitor and improve Elasticsearch cluster health using the best practices and tools. Author(s) Alberto Paro is a seasoned software engineer and data scientist with extensive experience in distributed systems and search technologies. Having worked on numerous search-related projects, he brings practical, real-world insights to his writing. Alberto is passionate about teaching and simplifying complex concepts, making this book both approachable and expertly detailed. Who is it for? This book is ideal for developers or data engineers seeking to utilize Elasticsearch for advanced search and analytics tasks. If you have some prior knowledge of JSON and programming concepts, particularly Java, you will benefit most from this material. Whether you're looking to integrate Elasticsearch into your systems or to optimize its usage, this book caters to your needs.

Tabular Modeling with SQL Server 2016 Analysis Services Cookbook

2017-01-30 O'Reilly Amazon

book

Derek Wilson

data data-engineering relational-databases microsoft-sql-server Analytics BI

With "Tabular Modeling with SQL Server 2016 Analysis Services Cookbook," you'll discover how to harness the full potential of the latest Tabular models in SQL Server Analysis Services (SSAS). This practical guide equips data professionals with the tools, techniques, and knowledge to optimize data analytics and deliver fast, reliable, and impactful business insights. What this Book will help me do Understand the fundamentals of Tabular modeling and its advantages over traditional methods. Use SQL Server 2016 SSAS features to build and deploy Tabular models tailored to business needs. Master DAX for creating powerful calculated fields and optimized measures. Administer and secure your models effectively, ensuring robust BI solutions. Optimize performance and explore advanced features in Tabular solutions for maximum efficiency. Author(s) None Wilson is an experienced SQL BI professional with a strong background in database modeling and analytics. With years of hands-on experience in developing BI solutions, Wilson takes a practical and straightforward teaching approach. Their guidance in this book makes the complex topics of Tabular modeling and SSAS accessible to both seasoned professionals and newcomers to the field. Who is it for? This book is tailored for SQL BI professionals, database architects, and data analysts aiming to leverage Tabular models in SQL Server Analysis Services. It caters to those familiar with database management and basic BI concepts who are eager to improve their analysis solutions. It's a valuable resource if you aim to gain expertise in using tabular modeling for business intelligence.

IBM DS8880 Architecture and Implementation (Release 8.2.1)

2017-01-25 O'Reilly Amazon

book

Bjoern Wesselbaum , Kerstin Blum , Peter Kimmel , Andre Coelho , Sherry Brunson , Bert Dufrasne , Jeffery Cook

data data-engineering IBM Analytics

This IBM® Redbooks® publication describes the concepts, architecture, and implementation of the IBM DS8880 family. The book provides reference information to assist readers who need to plan for, install, and configure the DS8880 systems. The IBM DS8000® family is a high-performance, high-capacity, highly secure, and resilient series of disk storage systems. The DS8880 family is the latest and most advanced of the DS8000 offerings to date. The high availability, multiplatform support, including IBM z Systems®, and simplified management tools help provide a cost-effective path to an on-demand world. The IBM DS8880 family now offers business-critical, all-flash, and hybrid data systems that span a wide range of price points: DS8884 -- Business Class DS8886 -- Enterprise Class DS8888 -- Analytics Class The DS8884 and DS8886 are available as either hybrid models, or can be configured as all-flash. Each model represents the most recent in this series of high-performance, high-capacity, flexible, and resilient storage systems. These systems are intended to address the needs of the most demanding clients. Two powerful IBM POWER8® processor-based servers manage the cache to streamline disk I/O, maximizing performance and throughput. These capabilities are further enhanced with the availability of the second generation of high-performance flash enclosures (HPFEs Gen-2). Like its predecessors, the DS8880 supports advanced disaster recovery (DR) solutions, business continuity solutions, and thin provisioning. All disk drives in the DS8880 storage system include the Full Disk Encryption (FDE) feature. The DS8880 can automatically optimize the use of each storage tier, particularly flash drives and flash cards, through the IBM Easy Tier® feature. The DS8880 also includes the Copy Services Manager code and allows for easier integration in a Lightweight Directory Access Protocol (LDAP) infrastructure.

Introducing and Implementing IBM FlashSystem V9000

2016-12-28 O'Reilly Amazon

book

Christophe Fagiano , Jon Herd , Detlef Helmbrecht , Carsten Larsen , Renato Santos , Jeffrey Irving , James Thompson , Jana Jamsek

data data-engineering IBM Analytics Cloud Computing Data Management

The success or failure of businesses often depends on how well organizations use their data assets for competitive advantage. Deeper insights from data require better information technology. As organizations modernize their IT infrastructure to boost innovation rather than limit it, they need a data storage system that can keep pace with highly virtualized environments, cloud computing, mobile and social systems of engagement, and in-depth, real-time analytics. Making the correct decision on storage investment is critical. Organizations must have enough storage performance and agility to innovate as they need to implement cloud-based IT services, deploy virtual desktop infrastructure, enhance fraud detection, and use new analytics capabilities. At the same time, future storage investments must lower IT infrastructure costs while helping organizations to derive the greatest possible value from their data assets. The IBM® FlashSystem V9000 is the premier, fully integrated, Tier 1, all-flash offering from IBM. It has changed the economics of today’s data center by eliminating storage bottlenecks. Its software-defined storage features simplify data management, improve data security, and preserve your investments in storage. The IBM FlashSystem® V9000 SAS expansion enclosures provide new tiering options with read-intensive SSDs or nearline SAS HDDs. IBM FlashSystem V9000 includes IBM FlashCore® technology and advanced software-defined storage available in one solution in a compact 6U form factor. IBM FlashSystem V9000 improves business application availability. It delivers greater resource utilization so you can get the most from your storage resources, and achieve a simpler, more scalable, and cost-efficient IT Infrastructure. This IBM Redbooks® publication provides information about IBM FlashSystem V9000 Software V7.7 and introduces the recently announced V7.8. It describes the product architecture, software, hardware, and implementation, and provides hints and tips. It illustrates use cases and independent software vendor (ISV) scenarios that demonstrate real-world solutions, and also provides examples of the benefits gained by integrating the IBM FlashSystem storage into business environments. This book offers IBM FlashSystem V9000 scalability concepts and guidelines for planning, installing, and configuring, which can help environments scale up and out to add more flash capacity and expand virtualized systems. Port utilization methodologies are provided to help you maximize the full potential of IBM FlashSystem V9000 performance and low latency in your scalable environment. This book is intended for pre-sales and post-sales technical support professionals, storage administrators, and anyone who wants to understand how to implement this exciting technology.

Apache Spark for Data Science Cookbook

2016-12-22 O'Reilly Amazon

book

Padma Priya Chitturi

data data-engineering apache-spark AI/ML Analytics Big Data

In "Apache Spark for Data Science Cookbook," you'll delve into solving real-world analytical challenges using the robust Apache Spark framework. This book features hands-on recipes that cover data analysis, distributed machine learning, and real-time data processing. You'll gain practical skills to process, visualize, and extract insights from large datasets efficiently. What this Book will help me do Master using Apache Spark for processing and analyzing large-scale datasets effectively. Harness Spark's MLLib for implementing machine learning algorithms like classification and clustering. Utilize libraries such as NumPy, SciPy, and Pandas in conjunction with Spark for numerical computations. Apply techniques like Natural Language Processing and text mining using Spark-integrated tools. Perform end-to-end data science workflows, including data exploration, modeling, and visualization. Author(s) Nagamallikarjuna Inelu and None Chitturi bring their extensive experience working with data science and distributed computing frameworks like Apache Spark. Nagamallikarjuna specializes in applying machine learning algorithms to big data problems, while None has contributed to various big data system implementations. Together, they focus on providing practitioners with practical and efficient solutions. Who is it for? This book is primarily intended for novice and intermediate data scientists and analysts who are curious about using Apache Spark to tackle data science problems. Readers are expected to have some familiarity with basic data science tasks. If you want to learn practical applications of Spark in data analysis and enhance your big data analytics skills, this resource is for you.

Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

2016-12-12 O'Reilly Amazon

book

Casey Stella , Douglas Eadline , Ofer Mendelevitch

data data-engineering Hadoop AI/ML Analytics Big Data

The Complete Guide to Data Science with Hadoop—For Technical Professionals, Businesspeople, and Students Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials. Practical Data Science with Hadoop® and Spark The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization. Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP). This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives. Learn What data science is, how it has evolved, and how to plan a data science career How data volume, variety, and velocity shape data science use cases Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark Data importation with Hive and Spark Data quality, preprocessing, preparation, and modeling Visualization: surfacing insights from huge data sets Machine learning: classification, regression, clustering, and anomaly detection Algorithms and Hadoop tools for predictive modeling Cluster analysis and similarity functions Large-scale anomaly detection NLP: applying data science to human language

Implementing IBM FlashSystem 900

2016-11-18 O'Reilly Amazon

book

Karen Orlando , Jon Herd , Detlef Helmbrecht , Carsten Larsen , Ingo Dimmer , Matt Levan

data data-engineering IBM Analytics Cloud Computing

Today’s global organizations depend on being able to unlock business insights from massive volumes of data. Now, with IBM® FlashSystem 900, powered by IBM FlashCore™ technology, they can make faster decisions based on real-time insights and unleash the power of the most demanding applications, including online transaction processing (OLTP) and analytics databases, virtual desktop infrastructures (VDIs), technical computing applications, and cloud environments. This IBM Redbooks® publication introduces clients to the IBM FlashSystem® 900. It provides in-depth knowledge of the product architecture, software and hardware, implementation, and hints and tips. Also illustrated are use cases that show real-world solutions for tiering, flash-only, and preferred-read, and also examples of the benefits gained by integrating the FlashSystem storage into business environments. This book is intended for pre-sales and post-sales technical support professionals and storage administrators, and for anyone who wants to understand how to implement this new and exciting technology. This book describes the following offerings of the IBM Spectrum™ Storage family: IBM Spectrum Storage™ IBM Spectrum Control™ IBM Spectrum Virtualize™ IBM Spectrum Scale™ IBM Spectrum Accelerate™

The Big Data Transformation

2016-11-15 O'Reilly Amazon

book

Ashish Thusoo

data data-engineering Analytics Big Data Data Analytics DWH

Business executives today are well aware of the power of data, especially for gaining actionable insight into products and services. But how do you jump into the big data analytics game without spending millions on data warehouse solutions you don’t need? This 40-page report focuses on massively parallel processing (MPP) analytical databases that enable you to run queries and dashboards on a variety of business metrics at extreme speed and Exabyte scale. Because they leverage the full computational power of a cluster, MPP analytical databases can analyze massive volumes of data—both structured and semi-structured—at unprecedented speeds. This report presents five real-world case studies from Etsy, Cerner Corporation, Criteo and other global enterprises to focus on one big data analytics platform in particular, HPE Vertica. You’ll discover: How one prominent data storage company convinced both business and tech stakeholders to adopt an MPP analytical database Why performance marketing technology company Criteo used a Center of Excellence (CoE) model to ensure the success of its big data analytics endeavors How YPSM uses Vertica to speed up its Hadoop-based data processing environment Why Cerner adopted an analytical database to scale its highly successful health information technology platform How Etsy drives success with the company’s big data initiative by avoiding common technical and organizational mistakes

Oracle R Enterprise: Harnessing the Power of R in Oracle Database

2016-11-04 O'Reilly Amazon

book

Brendan Tierney

data data-engineering oracle-database-solutions Analytics API Big Data

Master the Big Data Capabilities of Oracle R Enterprise Effectively manage your enterprise’s big data and keep complex processes running smoothly using the hands-on information contained in this Oracle Press guide. Oracle R Enterprise: Harnessing the Power of R in Oracle Database shows, step-by-step, how to create and execute large-scale predictive analytics and maintain superior performance. Discover how to explore and prepare your data, accurately model business processes, generate sophisticated graphics, and write and deploy powerful scripts. You will also find out how to effectively incorporate Oracle R Enterprise features in APEX applications, OBIEE dashboards, and Apache Hadoop systems. Learn to: • Install, configure, and administer Oracle R Enterprise • Establish connections and move data to the database • Create Oracle R Enterprise packages and functions • Use the R language to work with data in Oracle Database • Build models using ODM, ORE, and other algorithms • Develop and deploy R scripts and use the R script repository • Execute embedded R scripts and employ ORE SQL API functions • Map and manipulate data using Oracle R Advanced Analytics for Hadoop • Use ORE in Oracle Data Miner, OBIEE, and other applications

Spark in Action

2016-11-03 O'Reilly Amazon

book

Marko Bonaci , Petar Zecevic

data data-engineering apache-spark AI/ML Analytics API

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0. About the Technology Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades. About the Book Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You'll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you'll start programming Spark using its core APIs. Along the way, you'll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book's code. What's Inside Updated for Spark 2.0 Real-life case studies Spark DevOps with Docker Examples in Scala, and online in Java and Python About the Reader Written for experienced programmers with some background in big data or machine learning. About the Authors Petar Zečević and Marko Bonaći are seasoned developers heavily involved in the Spark community. Quotes Dig in and get your hands dirty with one of the hottest data processing engines today. A great guide. - Jonathan Sharley, Pandora Media Must-have! Speed up your learning of Spark as a distributed computing framework. - Robert Ormandi, Yahoo! An easy-to-follow, step-by-step guide. - Gaurav Bhardwaj, 3Pillar Global An ambitiously comprehensive overview of Spark and its diverse ecosystem. - Jonathan Miller, Optensity

Fast Data Processing with Spark 2 - Third Edition

2016-10-24 O'Reilly Amazon

book

Krishna Sankar , Holden Karau

data data-engineering apache-spark AI/ML Analytics API

Fast Data Processing with Spark 2 takes you through the essentials of leveraging Spark for big data analysis. You will learn how to install and set up Spark, handle data using its APIs, and apply advanced functionality like machine learning and graph processing. By the end of the book, you will be well-equipped to use Spark in real-world data processing tasks. What this Book will help me do Install and configure Apache Spark for optimal performance. Interact with distributed datasets using the resilient distributed dataset (RDD) API. Leverage the flexibility of DataFrame API for efficient big data analytics. Apply machine learning models using Spark MLlib to solve complex problems. Perform graph analysis using GraphX to uncover structural insights in data. Author(s) Krishna Sankar is an experienced data scientist and thought leader in big data technologies. With a deep understanding of machine learning, distributed systems, and Apache Spark, Krishna has guided numerous projects in data engineering and big data processing. Matei Zaharia, the co-author, is also widely recognized in the field of distributed systems and cloud computing, contributing to Apache Spark development. Who is it for? This book is catered to software developers and data engineers with a foundational understanding of Scala or Java programming. Beginner to medium-level understanding of big data processing concepts is recommended for readers. If you are aspiring to solve big data problems using scalable distributed computing frameworks, this book is perfect for you. By the end, you will be confident in building Spark-powered applications and analyzing data efficiently.

In-Place Analytics with Live Enterprise Data with IBM DB2 Query Management Facility

2016-10-21 O'Reilly Amazon

book

Shawn Sullivan , Doug Anderson , Linus Olson , Mike Biere

data data-engineering relational-databases ibm-db2 Analytics BI

IBM® DB2® Query Management Facility™ for z/OS® provides a zero-footprint, mobile-enabled, highly secure business analytics solution. IBM QMF™ V11.2.1 offers many significant new features and functions in keeping with the ongoing effort to broaden its usage and value to a wider set of users and business areas. In this IBM Redbooks® publication, we explore several of the new features and options that are available within this new release. This publication introduces TSO enhancements for QMF Analytics for TSO and QMF Enhanced Editor. A chapter describes how the QMF Data Service component connects to multiple mainframe data sources to accomplish the consolidation and delivery of data. This publication describes how self-service business intelligence can be achieved by using QMF Vision to enable self-service dashboards and data exploration. A chapter is dedicated to JavaScript support, demonstrating how application developers can use JavaScript to extend the capabilities of QMF. Additionally, this book describes methods to take advantage of caching for reduced CPU consumption, wider access to information, and faster performance. This publication is of interest to anyone who wants to better understand how QMF can enable in-place analytics with live enterprise data.

VersaStack Solution by Cisco and IBM with Oracle RAC, IBM FlashSystem V9000, and IBM Spectrum Protect

2016-10-17 O'Reilly Amazon

book

Jon Tate , Dong Hai Yu , Randy Watson , Dharmesh Kamdar

data data-engineering IBM Agile/Scrum Analytics Cloud Computing

Dynamic organizations want to accelerate growth while reducing costs. To do so, they must speed the deployment of business applications and adapt quickly to any changes in priorities. Organizations today require an IT infrastructure that is easy, efficient, and versatile. The VersaStack solution by Cisco and IBM® can help you accelerate the deployment of your data centers. It reduces costs by more efficiently managing information and resources while maintaining your ability to adapt to business change. The VersaStack solution combines the innovation of Cisco UCS Integrated Infrastructure with the efficiency of the IBM Storwize® storage system. The Cisco UCS Integrated Infrastructure includes the Cisco Unified Computing System (Cisco UCS), Cisco Nexus and Cisco MDS switches, and Cisco UCS Director. The IBM FlashSystem® V9000 enhances virtual environments with its Data Virtualization, IBM Real-time Compression™, and IBM Easy Tier® features. These features deliver extraordinary levels of performance and efficiency. The VersaStack solution is Cisco Application Centric Infrastructure (ACI) ready. Your IT team can build, deploy, secure, and maintain applications through a more agile framework. Cisco Intercloud Fabric capabilities help enable the creation of open and highly secure solutions for the hybrid cloud. These solutions accelerate your IT transformation while delivering dramatic improvements in operational efficiency and simplicity. Cisco and IBM are global leaders in the IT industry. The VersaStack solution gives you the opportunity to take advantage of integrated infrastructure solutions that are targeted at enterprise applications, analytics, and cloud solutions. The VersaStack solution is backed by Cisco Validated Designs (CVD) to provide faster delivery of applications, greater IT efficiency, and less risk. This IBM Redbooks® publication is aimed at experienced storage administrators who are tasked with deploying a VersaStack solution with Oracle Real Application Clusters (RAC) and IBM Spectrum™ Protect.

Spark for Data Science

2016-09-30 O'Reilly Amazon

book

Bikramaditya Singhal , Srinivas Duvvuri

data data-engineering apache-spark AI/ML Analytics API

Explore how to leverage Apache Spark for efficient big data analytics and machine learning solutions in "Spark for Data Science". This detailed guide provides you with the skills to process massive datasets, perform data analytics, and build predictive models using Spark's powerful tools like RDDs, DataFrames, and Datasets. What this Book will help me do Gain expertise in data processing and transformation with Spark. Perform advanced statistical analysis to uncover insights. Master machine learning techniques to create predictive models using Spark. Utilize Spark's APIs to process and visualize big data. Build scalable and efficient data science solutions. Author(s) This book is co-authored by None Singhal and None Duvvuri, both accomplished data scientists with extensive experience in Apache Spark and big data technologies. They bring their practical industry expertise to explain complex topics in a straightforward manner. Their writing emphasizes real-world applications and step-by-step procedural guidance, making this a valuable resource for learners. Who is it for? This book is ideally suited for technologists seeking to incorporate data science capabilities into their work with Apache Spark, data scientists interested in machine learning algorithms implemented in Spark, and beginners aiming to step into the field of big data analytics. Whether you are familiar with Spark or completely new, this book offers valuable insights and practical knowledge.

Big Data Analytics

2016-09-28 O'Reilly Amazon

book

Aravind Nallan , Venkat Ankam

data data-engineering apache-spark AI/ML Analytics Big Data

Dive into the world of big data with "Big Data Analytics: Real Time Analytics Using Apache Spark and Hadoop." This comprehensive guide introduces readers to the fundamentals and practical applications of Apache Spark and Hadoop, covering essential topics like Spark SQL, DataFrames, structured streaming, and more. Learn how to harness the power of real-time analytics and big data tools effectively. What this Book will help me do Master the key components of Apache Spark and Hadoop ecosystems, including Spark SQL and MapReduce. Gain an understanding of DataFrames, DataSets, and structured streaming for seamless data handling. Develop skills in real-time analytics using Spark Streaming and technologies like Kafka and HBase. Learn to implement machine learning models using Spark's MLlib and ML Pipelines. Explore graph analytics with GraphX and leverage data visualization tools like Jupyter and Zeppelin. Author(s) Venkat Ankam, an expert in big data technologies, has years of experience working with Apache Hadoop and Spark. As an educator and technical consultant, Venkat has enabled numerous professionals to gain critical insights into big data ecosystems. With a pragmatic approach, his writings aim to guide readers through complex systems in a structured and easy-to-follow manner. Who is it for? This book is perfect for data analysts, data scientists, software architects, and programmers aiming to expand their knowledge of big data analytics. Readers should ideally have a basic programming background in languages like Python, Scala, R, or SQL. Prior hands-on experience with big data environments is not necessary but is an added advantage. This guide is created to cater to a range of skill levels, from beginners to intermediate learners.

Big Data War

2016-08-26 O'Reilly Amazon

book

Patrick H. Park

data data-engineering Analytics Big Data Data Analytics

This book mainly focuses on why data analytics fails in business. It provides an objective analysis and root causes of the phenomenon, instead of abstract criticism of utility of data analytics. The author, then, explains in detail on how companies can survive and win the global big data competition, based on actual cases of companies. Having established the execution and performance-oriented big data methodology based on over 10 years of experience in the field as an authority in big data strategy, the author identifies core principles of data analytics using case analysis of failures and successes of actual companies. Moreover, he endeavors to share with readers the principles regarding how innovative global companies became successful through utilization of big data. This book is a quintessential big data analytics, in which the author’s knowhow from direct and indirect experiences is condensed. How do we survive at this big data war in which Facebook in SNS, Amazon in e-commerce, Google in search, expand their platforms to other areas based on their respective distinct markets? The answer can be found in this book.

IBM Data Engine for Hadoop and Spark

2016-08-24 O'Reilly Amazon

book

Dino Quintero , Reinaldo Tetsuo Katahira , Aditya Gandakusuma Sutandyo , Nicolas Joly , Luis Bolinches

data data-engineering IBM Analytics Big Data Hadoop

This IBM® Redbooks® publication provides topics to help the technical community take advantage of the resilience, scalability, and performance of the IBM Power Systems™ platform to implement or integrate an IBM Data Engine for Hadoop and Spark solution for analytics solutions to access, manage, and analyze data sets to improve business outcomes. This book documents topics to demonstrate and take advantage of the analytics strengths of the IBM POWER8® platform, the IBM analytics software portfolio, and selected third-party tools to help solve customer's data analytic workload requirements. This book describes how to plan, prepare, install, integrate, manage, and show how to use the IBM Data Engine for Hadoop and Spark solution to run analytic workloads on IBM POWER8. In addition, this publication delivers documentation to complement available IBM analytics solutions to help your data analytic needs. This publication strengthens the position of IBM analytics and big data solutions with a well-defined and documented deployment model within an IBM POWER8 virtualized environment so that customers have a planned foundation for security, scaling, capacity, resilience, and optimization for analytics workloads. This book is targeted at technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) that are responsible for delivering analytics solutions and support on IBM Power Systems.

Real World SQL and PL/SQL: Advice from the Experts

2016-08-22 O'Reilly Amazon

book

Arup Nanda , Alex Nuitjen , Heli Helskyaho , Martin Widlake , Brendan Tierney

data data-engineering SQL pl-sql pl/sql Analytics

Master the Underutilized Advanced Features of SQL and PL/SQL This hands-on guide from Oracle Press shows how to fully exploit lesser known but extremely useful SQL and PL/SQL features―and how to effectively use both languages together. Written by a team of Oracle ACE Directors, Real-World SQL and PL/SQL: Advice from the Experts features best practices, detailed examples, and insider tips that clearly demonstrate how to write, troubleshoot, and implement code for a wide variety of practical applications. The book thoroughly explains underutilized SQL and PL/SQL functions and lays out essential development strategies. Data modeling, advanced analytics, database security, secure coding, and administration are covered in complete detail. Learn how to: • Apply advanced SQL and PL/SQL tools and techniques • Understand SQL and PL/SQL functionality and determine when to use which language • Develop accurate data models and implement business logic • Run PL/SQL in SQL and integrate complex datasets • Handle PL/SQL instrumenting and profiling • Use Oracle Advanced Analytics and Oracle R Enterprise • Build and execute predictive queries • Secure your data using encryption, hashing, redaction, and masking • Defend against SQL injection and other code-based attacks • Work with Oracle Virtual Private Database Code examples in the book are available for download at www.MHProfessional.com. TAG: For a complete list of Oracle Press titles, visit www.OraclePressBooks.com

Architecting for Access

2016-08-15 O'Reilly Amazon

book

Rich Morrow

data data-engineering Analytics Hadoop Looker NoSQL

Fragmented, disparate backend data systems have become the norm in today’s enterprise, where you’ll find a mix of relational databases, Hadoop stores, and NoSQL engines, with access and analytics tools bolted on every which way. This mishmash of options presents a real challenge when it comes to choosing frontend analytics and visualization tools. How did we get here? In this O’Reilly report, IT veteran Rich Morrow takes you through the rapid changes to both backend storage and frontend analytics over the past decade, and provides a pragmatic list of requirements for an analytics stack that will centralize access to all of these data systems. You’ll examine current analytics platforms, including Looker—a new breed of analytics and visualization tools built specifically to handle our fragmented data space. Understand why and how data became so fractured so quickly Explore the tangled web of data and backend tools in today’s enterprises Learn the tool requirements for accessing and analyzing the full spectrum of data Examine the relative strengths of popular analytics and visualization tools, including Looker, Tableau, and MicroStrategy Inspect Looker’s unique focus on both the frontend and backend

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Mastering Spark for Data Science

Learning Apache Spark 2

SQL Server 2016 Developer's Guide

Mastering Elastic Stack

Mastering Elasticsearch 5.x - Third Edition

Geospatial Data and Analysis

Elasticsearch 5.x Cookbook - Third Edition

Tabular Modeling with SQL Server 2016 Analysis Services Cookbook

IBM DS8880 Architecture and Implementation (Release 8.2.1)

Introducing and Implementing IBM FlashSystem V9000

Apache Spark for Data Science Cookbook

Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

Implementing IBM FlashSystem 900

The Big Data Transformation

Oracle R Enterprise: Harnessing the Power of R in Oracle Database

Spark in Action

Fast Data Processing with Spark 2 - Third Edition

In-Place Analytics with Live Enterprise Data with IBM DB2 Query Management Facility

VersaStack Solution by Cisco and IBM with Oracle RAC, IBM FlashSystem V9000, and IBM Spectrum Protect

Spark for Data Science

Big Data Analytics

Big Data War

IBM Data Engine for Hadoop and Spark

Real World SQL and PL/SQL: Advice from the Experts

Architecting for Access