data-engineering

Building on Multi-Model Databases

2017-07-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pete Aven

API Data Management Cyber Security data data-models

In many organizations today, businesspeople are busy requesting unified views of data stored across multiple sources within their organizations. But integrating multiple data types from multiple data stores is a complex, error-prone, and time-consuming process of cobbling everything together manually. This concise book examines how multi-model databases can help you integrate data storage and access across your organization in a seamless and elegant way. Author Pete Aven and Diane Burley from MarkLogic explain how this latest evolution in data management naturally accepts heterogeneous data, enabling you to eventually phase out technical data silos. Through several case studies, you’ll discover how organizations use multi-model databases to reduce complexity, save money, take advantage of opportunities, lessen risk, and shorten time to value. Get unified views across disparate data models and formats within a single database Learn how multi-model databases leverage the inherent structure of the data being stored Load and use unstructured and semi-structured data (such as documents and text) as is Provide agility in data access and delivery through APIs, interfaces, and indexes Learn how to scale a multi-model database, and provide ACID capabilities and security Examine how a multi-model database would fit into your existing architecture

Streaming Data

2017-07-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andrew Psaltis

Analytics Flink Kafka Spark Data Streaming data streaming-architecture streaming-messaging

Streaming Data introduces the concepts and requirements of streaming and real-time data systems. The book is an idea-rich tutorial that teaches you to think about how to efficiently interact with fast-flowing data. About the Technology As humans, we're constantly filtering and deciphering the information streaming toward us. In the same way, streaming data applications can accomplish amazing tasks like reading live location data to recommend nearby services, tracking faults with machinery in real time, and sending digital receipts before your customers leave the shop. Recent advances in streaming data technology and techniques make it possible for any developer to build these applications if they have the right mindset. This book will let you join them. About the Book Streaming Data is an idea-rich tutorial that teaches you to think about efficiently interacting with fast-flowing data. Through relevant examples and illustrated use cases, you'll explore designs for applications that read, analyze, share, and store streaming data. Along the way, you'll discover the roles of key technologies like Spark, Storm, Kafka, Flink, RabbitMQ, and more. This book offers the perfect balance between big-picture thinking and implementation details. What's Inside The right way to collect real-time data Architecting a streaming pipeline Analyzing the data Which technologies to use and when About the Reader Written for developers familiar with relational database concepts. No experience with streaming or real-time applications required. About the Author Andrew Psaltis is a software engineer focused on massively scalable real-time analytics. Quotes The definitive book if you want to master the architecture of an enterprise-grade streaming application. - Sergio Fernandez Gonzalez, Accenture A thorough explanation and examination of the different systems, strategies, and tools for streaming data implementations. - Kosmas Chatzimichalis, Mach 7x A well-structured way to learn about streaming data and how to put it into practice in modern real-time systems. - Giuliano Araujo Bertoti, FATEC This book is all you need to understand what streaming is all about! - Carlos Curotto, Globant

Building Custom Tasks for SQL Server Integration Services

2017-07-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andy Leonard

DevOps ETL/ELT Microsoft SQL SSIS data microsoft-sql-server relational-databases

Learn to build custom SSIS tasks using Visual Studio Community Edition and Visual Basic. Bring all the power of Microsoft .NET to bear on your data integration and ETL processes, and for no added cost over what you’ve already spent on licensing SQL Server. If you already have a license for SQL Server, then you do not need to spend more money to extend SSIS with custom tasks and components. Why are custom components necessary? Because even though the SSIS catalog of built-in tasks and components is a marvel of engineering, there do remain gaps in the functionality that is provided. These gaps are especially relevant to enterprises practicing Data Integration Lifecycle Management (DILMS) and/or DevOps. One of the gaps is a limitation of the SSIS Execute Package task. Developers using the stock version of that task are unable to select SSIS packages from other projects. Yet it’s useful to be able to select and execute tasks across projects, and the example used throughout this book will help you to create an Execute Catalog Package task that does in fact allow you to execute a task from another project. Building on the example’s pattern, you can create any task that you like, custom tailored to your specific, data integration and ETL needs. What You Will Learn Configure and execute Visual Studio in the way that best supports SSIS task development Create a class library as the basis for an SSIS task, and reference the needed SSIS assemblies Properly sign assemblies that you create in order to invoke them from your task Implement source code control via Visual Studio Team Services, or your own favorite tool set Code not only your tasks themselves, but also the associated task editors Troubleshoot and then execute your custom tasks as part of your own project Who This Book Is For Database administrators and developers who are involved in ETL projects built around SQL Server Integration Services (SSIS). Readers should have a background in programming along with a desire to optimize their ETL efforts by creating custom-tailored tasks for execution from SSIS packages.

JSON at Work

2017-07-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tom Marrs

API Java JavaScript JSON JSON Schema Kafka MongoDB data storage-formats

JSON is becoming the backbone for meaningful data interchange over the internet. This format is now supported by an entire ecosystem of standards, tools, and technologies for building truly elegant, useful, and efficient applications. With this hands-on guide, author and architect Tom Marrs shows you how to build enterprise-class applications and services by leveraging JSON tooling and message/document design. JSON at Work provides application architects and developers with guidelines, best practices, and use cases, along with lots of real-world examples and code samples. You’ll start with a comprehensive JSON overview, explore the JSON ecosystem, and then dive into JSON’s use in the enterprise. Get acquainted with JSON basics and learn how to model JSON data Learn how to use JSON with Node.js, Ruby on Rails, and Java Structure JSON documents with JSON Schema to design and test APIs Search the contents of JSON documents with JSON Search tools Convert JSON documents to other data formats with JSON Transform tools Compare JSON-based hypermedia formats, including HAL and jsonapi Leverage MongoDB to store and access JSON documents Use Apache Kafka to exchange JSON-based messages between services

Understanding Organisation Development

2017-07-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Paul Tosey

data data-models

Develop a thorough grounding in the concept, development and varying models of Organisation Development.

Frank Kane's Taming Big Data with Apache Spark and Python

2017-06-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Frank Kane (Sundog Software)

AI/ML AWS Amazon EMR Big Data Cloud Computing Python Spark Data Streaming apache-spark data

This book introduces you to the world of Big Data processing using Apache Spark and Python. You will learn to set up and run Spark on different systems, process massive datasets, and create solutions to real-world Big Data challenges with over 15 hands-on examples included. What this Book will help me do Understand the basics of Apache Spark and its ecosystem. Learn how to process large datasets with Spark RDDs using Python. Implement machine learning models with Spark's MLlib library. Master real-time data processing with Spark Streaming modules. Deploy and run Spark jobs on cloud clusters using AWS EMR. Author(s) Frank Kane spent 9 years working at Amazon and IMDb, handling and solving real-world machine learning and Big Data problems. Today, as an instructional designer and educator, he brings his wealth of experience to learners around the globe by creating accessible, practical learning resources. His teaching is clear, engaging, and designed to prepare students for real-world applications. Who is it for? This book is ideal for data scientists or data analysts seeking to delve into Big Data processing with Apache Spark. Readers who have foundational knowledge of Python, as well as some understanding of data processing principles, will find this book useful to sharpen their skills further. It is designed for those eager to learn the practical applications of Big Data tools in today's industry environments. By the end of this book, you should feel confident tackling Big Data challenges using Spark and Python.

Learning Elasticsearch

2017-06-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Abhishek Andhavarapu

Analytics API Cloud Computing ELK Kibana data elasticsearch search

This comprehensive guide to Elasticsearch will teach you how to build robust and scalable search and analytics applications using Elasticsearch 5.x. You will learn the fundamentals of Elasticsearch, including its APIs and tools, and how to apply them to real-world problems. By the end of the book, you will have a solid grasp of Elasticsearch and be ready to implement your own solutions. What this Book will help me do Master the setup and configuration of Elasticsearch and Kibana. Learn to efficiently query and analyze both structured and unstructured data. Understand how to use Elasticsearch aggregations to perform advanced analytics. Gain knowledge of advanced search features including geospatial queries and autocomplete. Explore the Elastic Stack and learn deployment best practices and cloud hosting options. Author(s) None Andhavarapu is an expert in database technology and distributed systems, with years of experience in Elasticsearch. Their passion for search technologies is reflected in their clear and practical teaching style. They've written this guide to help developers of all levels get up to speed with Elasticsearch quickly and comprehensively. Who is it for? This book is perfect for software developers looking to implement effective search and analytics solutions. It's ideal for those who are new to Elasticsearch as well as for professionals familiar with other search tools like Lucene or Solr. The book assumes basic programming knowledge but no prior experience with Elasticsearch.

SQL Server 2017 Integration Services Cookbook

2017-06-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian Cote , Dejan Sarka , Matija Lah

Analytics DWH ETL/ELT SQL SSIS data microsoft-sql-server relational-databases

SQL Server 2017 Integration Services Cookbook is your key to mastering effective data integration and transformation solutions using SSIS 2017. Through clear, concise recipes, this book teaches the advanced ETL techniques necessary for creating efficient data workflows, leveraging both traditional and modern data platforms. What this Book will help me do Master the integration of diverse data sources into comprehensive data models. Develop optimized ETL workflows that improve operational efficiency. Leverage the new features introduced in SQL Server 2017 for enhanced data processing. Implement scalable data warehouse solutions suitable for modern analytics workloads. Customize and extend integration services to handle specific data transformation needs. Author(s) The authors are seasoned professionals in data integration and ETL technologies. They bring years of real-world experience using SQL Server Integration Services in various enterprise scenarios. Their combined expertise ensures practical insights and guidance, making complex concepts accessible to learners and practitioners alike. Who is it for? This book is ideal for data engineers and ETL developers who already understand the basics of SQL Server and want to master advanced data integration techniques. It is also suitable for database administrators and data analysts aiming to enhance their skill set with efficient ETL processes. Arm yourself with this guide to learn not just the how, but also the why, behind successful data transformations.

Implementing OpenStack SwiftHLM with IBM Spectrum Archive EE or IBM Spectrum Protect for Space Management

2017-06-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dominic Mller-Wicke , Larry Coyne , Khanh Ngo , Slavisa Sarafijanovic , Simon Lorenz , Harald Seipp , Takeshi Ishimoto

Data Management IBM data

The Swift High Latency Media project seeks to create a high-latency storage back end that makes it easier for users to perform bulk operations of data tiering within a Swift data ring. In today's world, data is produced at significantly higher rates than a decade ago. The storage and data management solutions of the past can no longer keep up with the data demands of today. The policies and structures that decide and execute how that data is used, discarded, or retained determines how efficiently the data is used. The need for intelligent data management and storage is more critical now than ever before. Traditional management approaches hide cost-effective, high-latency media (HLM) storage, such as tape or optical disk archive back ends, underneath a traditional file system. The lack of HLM-aware file system interfaces and software makes it difficult for users to understand and control data access on HLM storage. Coupled with data-access latency, this lack of understanding results in slow responses and potential time-outs that affect the user experience. The Swift HLM project addresses this challenge. Running OpenStack Swift on top of HLM storage allows you to cheaply store and efficiently access large amounts of infrequently used object data. Data that is stored on tape storage can be easily adopted to an Object Storage data interface. This IBM® Redpaper™ publication describes the Swift High Latency Media project and provides guidance for installation and configuration.

Development Workflows for Data Scientists

2017-06-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ciara Byrne

Data Science GitHub data

Data science teams often borrow best practices from software development, but since the product of a data science project is insight, not code, software development workflows are not a perfect fit. How can data scientists create workflows tailored to their needs? Through interviews with several data-driven organizations, this practical report reveals how data science teams are improving the way they define, enforce, and automate a development workflow. Data science workflows differ from team to team because their tasks, goals, and skills vary so much. In this report, author Ciara Byrne talked to teams from BinaryEdge, Airbnb, GitHub, Scotiabank, Fast Forward Labs, Datascope, and others about their approaches to the data science process, including their procedures for: Defining team structure and roles Asking interesting questions Examining previous work Collecting, exploring, and modeling data Testing, documenting, and deploying code to production Communicating the results With this report, you’ll also examine a complete data science workflow developed by the team from Swiss cybersecurity firm BinaryEdge that includes steps for preliminary data analysis, exploratory data analysis, knowledge discovery, and visualization.

Understanding Message Brokers

2017-06-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jakub Korab

Java Kafka data streaming-messaging streaming & messaging

Messaging is one of the more poorly understood areas of IT; most developers and architects have only a passing familiarity with how broker-based messaging technologies work. This practical report not only helps you get up to speed on the essentials of messaging, but also compares two of today’s most popular messaging technologies—Apache ActiveMQ and Apache Kafka. Author and consultant Jakub Korab describes use cases and design choices that lead developers to very different approaches for developing message-based systems. You’ll come away with a high-level understanding of both ActiveMQ and Kafka, including how they should and should not be used, how they handle concerns such as throughput and high-availability, and what to look out for when considering other messaging technologies in future. Understand the types of problems that messaging systems address Explore three primary messaging patterns: point-to-point, publish-subscribe, and a hybrid of both Dive into ActiveMQ, a classic broker-centric design implemented through Java libraries that works for a broad range of messaging use cases Examine Kafka, a distributed system that can be scaled to provide massive performance and fault tolerance through replication Learn the mechanical complexities that message-based systems need to address, and some patterns you can apply to deal with those complexities

Practical GIS

2017-06-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Gábor Farkas

GIS data geographic-information-system-gis geographic information system (gis) location-data

Practical GIS introduces you to the world of Geographic Information Systems (GIS) using accessible, open source tools. From setting up your GIS environment to creating and analyzing spatial data and publishing it online, this book covers everything you need to perform both beginner and advanced GIS tasks. What this Book will help me do Understand the fundamentals of GIS and use open source tools effectively. Be able to collect, store, query, and manage spatial data efficiently. Perform advanced spatial analyses and solve real-world GIS problems practically. Learn how to publish and share GIS data and results using QGIS Server and GeoServer. Create web maps using lightweight web mapping libraries like Leaflet. Author(s) The authors of Practical GIS bring years of professional experience in GIS and data analysis, combining technical know-how with a teaching approach accessible to a wide range of learners. They strive to convey complex GIS concepts simply and practically, fostering a hands-on learning experience. Who is it for? This book is ideal for IT professionals new to GIS or those considering entering the GIS field. If you're looking for a cost-effective way to learn GIS without investing in expensive commercial software or formal training, Practical GIS provides the knowledge and tools you need. Beginners and intermediate learners alike will find this book to be a helpful stepping stone in mastering GIS.

Advanced Analytics with Spark, 2nd Edition

2017-06-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sandy Ryza (Databricks) , Sean Owen (Databricks) , Josh Wills , Uri Laserson

AI/ML Analytics Data Science Java Python Scala Cyber Security Spark apache-spark data

In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications. With this book, you will: Familiarize yourself with the Spark programming model Become comfortable within the Spark ecosystem Learn general approaches in data science Examine complete implementations that analyze large public data sets Discover which machine learning tools make sense for particular problems Acquire code that can be adapted to many uses

Apache Spark 2.x Cookbook

2017-05-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rishi Yadav (Roost.ai)

AI/ML Analytics Big Data Cloud Computing Data Analytics Kafka Scala Spark Data Streaming apache-spark data

Discover how to harness the power of Apache Spark 2.x for your Big Data processing projects. In this book, you will explore over 70 cloud-ready recipes that will guide you to perform distributed data analytics, structured streaming, machine learning, and much more. What this Book will help me do Effectively install and configure Apache Spark with various cluster managers and platforms. Set up and utilize development environments tailored for Spark applications. Operate on schema-aware data using RDDs, DataFrames, and Datasets. Perform real-time streaming analytics with sources such as Apache Kafka. Leverage MLlib for supervised learning, unsupervised learning, and recommendation systems. Author(s) None Yadav is a seasoned data engineer with a deep understanding of Big Data tools and technologies, particularly Apache Spark. With years of experience in the field of distributed computing and data analysis, Yadav brings practical insights and techniques to enrich the learning experience of readers. Who is it for? This book is ideal for data engineers, data scientists, and Big Data professionals who are keen to enhance their Apache Spark 2.x skills. If you're working with distributed processing and want to solve complex data challenges, this book addresses practical problems. Note that a basic understanding of Scala is recommended to get the most out of this resource.

Data Lake for Enterprises

2017-05-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pankaj Misra , Tomcy John , Vivek Mishra

AI/ML AWS Lambda Big Data Data Lake Data Management Hadoop Java Kafka Spark data data-lake storage-repositories

"Data Lake for Enterprises" is a comprehensive guide to building data lakes using the Lambda Architecture. It introduces big data technologies like Hadoop, Spark, and Flume, showing how to use them effectively to manage and leverage enterprise-scale data. You'll gain the skills to design and implement data systems that handle complex data challenges. What this Book will help me do Master the use of Lambda Architecture to create scalable and effective data management systems. Understand and implement technologies like Hadoop, Spark, Kafka, and Flume in an enterprise data lake. Integrate batch and stream processing techniques using big data tools for comprehensive data analysis. Optimize data lakes for performance and reliability with practical insights and techniques. Implement real-world use cases of data lakes and machine learning for predictive data insights. Author(s) None Mishra, None John, and Pankaj Misra are recognized experts in big data systems with a strong background in designing and deploying data solutions. With a clear and methodical teaching style, they bring years of experience to this book, providing readers with the tools and knowledge required to excel in enterprise big data initiatives. Who is it for? This book is ideal for software developers, data architects, and IT professionals looking to integrate a data lake strategy into their enterprises. It caters to readers with a foundational understanding of Java and big data concepts, aiming to advance their practical knowledge of building scalable data systems. If you're eager to delve into cutting-edge technologies and transform enterprise data management, this book is for you.

Mastering PostGIS

2017-05-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tomasz Nycz , Michal Mackiewicz , Dominik Mikiewicz

ETL/ELT GIS JavaScript SQL data geographic-information-system-gis location-data postgis postgresql

"Mastering PostGIS" is your guide to unlocking the powerful capabilities of the PostGIS spatial database system. Across 328 pages, this book takes you through the essentials of spatial data handling, from importing, analyzing, and exporting datasets to building fully-functional GIS applications. You'll explore concepts such as spatial querying, data types, and integrating PostGIS with powerful tools like GeoServer and OpenLayers. What this Book will help me do Understand the fundamentals of PostGIS and its role in GIS workflows. Gain hands-on experience in SQL-based spatial queries and data manipulation. Develop the ability to integrate PostGIS with web platforms like Node.js, GeoServer, and OpenLayers. Discover strategies for spatial data ETL (Extract, Transform, Load) processes and live updates. Build scalable, efficient GIS applications leveraging PostGIS's capabilities. Author(s) George Silva, None Mikiewicz, and Michal Mackiewicz None are experts in GIS systems and database technologies with years of experience working with spatial databases such as PostGIS. Passionate about imparting practical knowledge, they provide clear, hands-on examples in every chapter to help you master spatial database solutions. Who is it for? This book is perfect for GIS developers and analysts looking to deepen their knowledge of PostGIS. If you aim to enhance your skills in designing GIS applications or performing spatial data analysis, this is your ideal resource. Prior experience with PostgreSQL and a basic installation of PostGIS are recommended for readers.

Mastering Ceph

2017-05-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Nick Fisk

Ansible Cloud Computing ceph data

Mastering Ceph offers a comprehensive guide to mastering the Ceph distributed storage system, empowering you to implement and manage scalable storage solutions effectively. As you delve into the chapters, you'll gain the practical experience needed to handle Ceph with confidence, achieve resource optimization, and ensure high availability for critical applications. What this Book will help me do Understand and utilize Ceph's advanced capabilities such as erasure coding and tiering for storage efficiency. Implement and manage scalable and resilient Ceph clusters effectively, easing resource allocation. Use tools like Ansible and Vagrant to deploy Ceph clusters quickly and reproducibly. Enhance your troubleshooting skills to resolve complex storage issues and ensure cluster stability. Develop applications to integrate with Ceph using Librados and distributed computation classes. Author(s) This book was authored by None Fisk, an experienced professional in cloud and distributed storage systems. Known for their expertise in Ceph, None Fisk shares practical insights developed over years of working as an administrator and developer. Through their accessible and systematic writing, they guide readers to overcome real-world storage challenges. Who is it for? This detailed guide is ideal for developers and system administrators familiar with deploying Ceph, who want to deepen their understanding of its advanced features. If you're aiming to optimize performance and design robust storage solutions, this is the book for you. Prior experience with Ceph is recommended to fully benefit from the book's insights.

Mastering PostgreSQL 9.6

2017-05-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hans-Jürgen Schönig

SQL data postgresql relational-databases

This comprehensive guide, 'Mastering PostgreSQL 9.6,' delves into the advanced features of PostgreSQL, equipping you with the skills to optimize queries, manage replication, and ensure high availability. Whether you are implementing advanced administrative tasks or enhancing database performance, this book will provide the tools and knowledge you need. What this Book will help me do Master advanced database functionalities in PostgreSQL 9.6. Enhance your proficiency in optimizing queries and using indexes effectively. Gain expertise in managing replication and ensuring high availability. Develop skills in server maintenance, monitoring, and resilience. Learn effective troubleshooting strategies for PostgreSQL database challenges. Author(s) Hans-Jürgen Schönig is an experienced database professional specializing in PostgreSQL consulting and training. With decades of experience in developing robust solutions, he brings a pragmatic and insightful approach to database management. His emphasis on practical application and clear explanations makes his writing accessible to learners at all levels. Who is it for? This book is ideal for PostgreSQL data architects and administrators looking to deepen their understanding of PostgreSQL's advanced functionalities. It's tailored for readers with prior experience in PostgreSQL administration and a working knowledge of SQL. If you're keen to master complex database tasks and optimize your PostgreSQL usage, you'll find this book invaluable.

Hadoop 2.x Administration Cookbook

2017-05-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Aman Singh

Hadoop HDFS Cyber Security data

Gain mastery over managing and maintaining large Apache Hadoop clusters with the Hadoop 2.x Administration Cookbook. This book provides practical step-by-step recipes guiding you to efficiently set up, optimize, and troubleshoot Hadoop clusters, ensuring high availability, security, and optimal performance in your data operations. What this Book will help me do Successfully set up and deploy an operational Hadoop 2.x cluster suitable for large-scale data operations. Effectively monitor and maintain Hadoop's HDFS, YARN, and MapReduce systems for optimized performance. Plan, configure, and enhance cluster availability using Zookeeper and Journal Node strategies. Develop workflows and manage data ingestion processes with tools like Flume and Oozie. Secure, troubleshoot, and optimize Hadoop environments to meet enterprise and operational standards. Author(s) Aman Singh is an experienced Hadoop administrator with years of hands-on experience managing robust and efficient Hadoop clusters. Aman has a deep understanding of the practical challenges faced in this field and a talent for breaking down complex topics into actionable steps. Through clear, problem-oriented language, Aman helps readers achieve fluency in Hadoop administration. Who is it for? This book is ideal for system administrators or IT professionals who have a foundational understanding of Hadoop and aim to strengthen their administrative skills. It is especially beneficial for experienced Hadoop administrators looking for a quick and practical reference guide to master cluster management. Whether you're working in a large enterprise or exploring Hadoop ecosystems for personal development, you'll find this book invaluable.

High Performance Spark

2017-05-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rachel Warren , Holden Karau (Fight Health Insurance)

AI/ML Scala Spark SQL Data Streaming apache-spark data

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources. Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing. With this book, you’ll explore: How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure The choice between data joins in Core Spark and Spark SQL Techniques for getting the most out of standard RDD transformations How to work around performance issues in Spark’s key/value pair paradigm Writing high-performance Spark code without Scala or the JVM How to test for functionality and performance when applying suggested improvements Using Spark MLlib and Spark ML machine learning libraries Spark’s Streaming components and external community packages

talk-data.com

Activity Trend

Top Events

Top Speakers

Building on Multi-Model Databases

Streaming Data

Building Custom Tasks for SQL Server Integration Services

JSON at Work

Understanding Organisation Development

Frank Kane's Taming Big Data with Apache Spark and Python

Learning Elasticsearch

SQL Server 2017 Integration Services Cookbook

Implementing OpenStack SwiftHLM with IBM Spectrum Archive EE or IBM Spectrum Protect for Space Management

Development Workflows for Data Scientists

Understanding Message Brokers

Practical GIS

Advanced Analytics with Spark, 2nd Edition

Apache Spark 2.x Cookbook

Data Lake for Enterprises

Mastering PostGIS

Mastering Ceph

Mastering PostgreSQL 9.6

Hadoop 2.x Administration Cookbook

High Performance Spark