O'Reilly Data Engineering Books

Practical Real-time Data Processing and Analytics

2017-09-28 O'Reilly Amazon

book

Prateek Bhati , Selva raj Ramasamy , Shilpi Saxena , Saurabh Gupta

data data-engineering streaming-messaging real-time-analytics Analytics Flink

This book provides a comprehensive guide to real-time data processing and analytics using modern frameworks like Apache Spark, Flink, Storm, and Kafka. Through practical examples and in-depth explanations, you will learn how to implement efficient, scalable, real-time processing pipelines. What this Book will help me do Understand real-time data processing essentials and the technology stack Learn integration of components like Apache Spark and Kafka Master the concepts of stream processing with detailed case studies Gain expertise in developing monitoring and alerting solutions for real-time systems Prepare to implement production-grade real-time data solutions Author(s) Shilpi Saxena and Saurabh Gupta, the authors, are experienced professionals in distributed systems and data engineering, focusing on practical applications of real-time computing. They bring their extensive industry experience to this book, helping readers understand the complexities of real-time data solutions in an approachable and hands-on manner. Who is it for? This book is ideal for software engineers and data engineers with a background in Java who seek to develop real-time data solutions. It is suitable for readers familiar with concepts of real-time data processing, and enhances knowledge in frameworks like Spark, Flink, Storm, and Kafka. Target audience includes learners building production data solutions and those designing distributed analytics engines.

IBM z14 Configuration Setup

2017-09-25 O'Reilly Amazon

book

Nelson Oliveira , Jannie Houlbjerg , Kazuhiro Nakajima , Bill White , Octavian Lascu , Peter Hoyle , Martin Söllig , Franco Pinto

data data-engineering IBM

Abstract IThis IBM® Redbooks® publication helps you install, configure, and maintain the IBM z14. The z14 offers new functions that require a comprehensive understanding of the available configuration options. This book presents configuration setup scenarios, and describes implementation examples in detail. This publication is intended for systems engineers, hardware planners, and anyone who needs to understand IBM Z configuration and implementation. Readers should be generally familiar with current IBM Z technology and terminology. For more information about the functions of the z14, see IBM z14 Technical Introduction, SG24-8450 and IBM z14 Technical Guide, SG24-8451.

Apache Spark 2.x Machine Learning Cookbook

2017-09-22 O'Reilly Amazon

book

Mohammed Guller , Meenakshi Rajendran , Shuen Mei , Broderick Hall , Siamak Amirghodsi

data data-engineering apache-spark AI/ML Analytics Big Data

This book is your gateway to mastering machine learning with Apache Spark 2.x. Through detailed hands-on recipes, you'll delve into building scalable ML models, optimizing big data processes, and enhancing project efficiency. Gain practical knowledge and explore real-world applications of recommendations, clustering, analytics, and more with Spark's powerful capabilities. What this Book will help me do Understand how to integrate Scala and Spark for effective machine learning development. Learn to create scalable recommendation engines using Spark. Master the development of clustering systems to organize unlabelled data at scale. Explore Spark libraries to implement efficient text analytics and search engines. Optimize large-scale data operations, tackling high-dimensional issues with Spark. Author(s) The team of authors brings expertise in machine learning, data science, and Spark technologies. Their combined industry experience and academic knowledge ensure the book is grounded in practical applications while offering theoretical insights. With clear explanations and a step-by-step approach, they aim to simplify complex concepts for developers and data scientists. Who is it for? This book is crafted for Scala developers familiar with machine learning concepts but seeking practical applications with Spark. If you have been implementing models but want to scale them and leverage Spark's robust ecosystem, this guide will serve you well. It is ideal for professionals seeking to deepen their skills in Spark and data science.

Kafka: The Definitive Guide

2017-09-19 O'Reilly Amazon

book

Gwen Shapira , Neha Narkhede , Todd Palino

data data-engineering streaming-messaging Kafka API Big Data

Every enterprise application creates data, whether it’s log messages, metrics, user activity, outgoing messages, or something else. And how to move all of this data becomes nearly as important as the data itself. If you’re an application architect, developer, or production engineer new to Apache Kafka, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds. Engineers from Confluent and LinkedIn who are responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. Through detailed examples, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer. Understand publish-subscribe messaging and how it fits in the big data ecosystem. Explore Kafka producers and consumers for writing and reading messages Understand Kafka patterns and use-case requirements to ensure reliable data delivery Get best practices for building data pipelines and applications with Kafka Manage Kafka in production, and learn to perform monitoring, tuning, and maintenance tasks Learn the most critical metrics among Kafka’s operational measurements Explore how Kafka’s stream delivery capabilities make it a perfect source for stream processing systems

IBM zPDT 2017 Sysplex Extensions

2017-09-12 O'Reilly Amazon

book

Frank Kyne , Bill Ogden

data data-engineering IBM Virtual Machine

Abstract This IBM® Redbooks® publication describes the IBM System z® Personal Development Tool (IBM zPDT®) 2017 Sysplex Extensions, which is a package that consists of sample files and supporting documentation to help you get a functioning, data sharing sysplex up and running with minimal time and effort. This book is a significant revision of zPDT 2016 Sysplex Extensions, SG24-8315, which is still available online for readers who need the IBM z/OS® 2.1 level of this package. This package is designed and tested to be installed on top of a standard Application Developer Controlled Distribution (ADCD) environment. It provides the extra files that you need to create a two-way data sharing IBM z/OS 2.2 sysplex that runs under IBM z/VM® in a zPDT environment.

High Availability for Oracle Database with IBM PowerHA SystemMirror and IBM Spectrum Virtualize HyperSwap

2017-09-07 O'Reilly Amazon

book

Ian MacQuarrie

data data-engineering IBM Oracle

This IBM® Redpaper™ publication describes the use of the IBM Spectrum™ Virtualize HyperSwap® function to provide a high availability (HA) storage infrastructure for Oracle databases across metro distances, using the IBM SAN Volume Controller. The HyperSwap function is available on all IBM storage technologies that use IBM Spectrum Virtualize™ software, which include the IBM SAN Volume Controller, IBM Storwize® V5000, IBM Storwize V7000, IBM FlashSystem® V9000, and IBM Spectrum Virtualize as software. This paper focuses on the functional behavior of HyperSwap when subjected to various failure conditions and provides detailed timings and error recovery sequences that occur in response to these failure conditions. This paper does not provide the details necessary to implement the reference architectures (although some implementation detail is provided).

IBM TS4500 R4 Tape Library Guide

2017-09-06 O'Reilly Amazon

book

Larry Coyne , Michael Engelbrecht , Simon Browne

data data-engineering IBM Cloud Computing ELK Cyber Security

Abstract The IBM® TS4500 (TS4500) tape library is a next-generation tape solution that offers higher storage density and integrated management than previous solutions. This IBM Redbooks® publication gives you a close-up view of the new IBM TS4500 tape library. In the TS4500, IBM delivers the density that today's and tomorrow's data growth requires. It has the cost-effectiveness and the manageability to grow with business data needs, while you preserve existing investments in IBM tape library products. Now, you can achieve both a low cost per terabyte (TB) and a high TB density per square foot, because the TS4500 can store up to 8.25 petabytes (PB) of uncompressed data in a single frame library or scale up at 1.5 PB per square foot to over 263 PB, which is more than 4 times the capacity of the IBM TS3500 tape library. The TS4500 offers these benefits: High availability dual active accessors with integrated service bays to reduce inactive service space by 40%. The Elastic Capacity option can be used to completely eliminate inactive service space. Flexibility to grow: The TS4500 library can grow from both the right side and the left side of the first L frame because models can be placed in any active position. Increased capacity: The TS4500 can grow from a single L frame up to an additional 17 expansion frames with a capacity of over 23,000 cartridges. High-density (HD) generation 1 frames from the existing TS3500 library can be redeployed in a TS4500. Capacity on demand (CoD): CoD is supported through entry-level, intermediate, and base-capacity configurations. Advanced Library Management System (ALMS): ALMS supports dynamic storage management, which enables users to create and change logical libraries and configure any drive for any logical library. Support for the IBM TS1155 while also supporting TS1150 and TS1140 tape drive: The TS1155 gives organizations an easy way to deliver fast access to data, improve security, and provide long-term retention, all at a lower cost than disk solutions. The TS1155 offers high-performance, flexible data storage with support for data encryption. Also, this enhanced fifth-generation drive can help protect investments in tape automation by offering compatibility with existing automation. The new TS1155 Tape Drive Model 55E delivers a 10 Gb Ethernet host attachment interface optimized for cloud-based and hyperscale environments. The TS1155 Tape Drive Model 55F delivers a native data rate of 360 MBps, the same load/ready, locate speeds, and access times as the TS1150, and includes dual-port 8 Gb Fibre Channel support. Support of the IBM Linear Tape-Open (LTO) Ultrium 7 tape drive: The LTO Ultrium 7 offering represents significant improvements in capacity, performance, and reliability over the previous generation, LTO Ultrium 6, while they still protect your investment in the previous technology. Integrated TS7700 back-end Fibre Channel (FC) switches are available. Up to four library-managed encryption (LME) key paths per logical library are available. This book describes the TS4500 components, feature codes, specifications, supported tape drives, encryption, new integrated management console (IMC), and command-line interface (CLI). You learn how to accomplish several specific tasks: Improve storage density with increased expansion frame capacity up to 2.4 times and support 33% more tape drives per frame. Manage storage by using the ALMS feature. Improve business continuity and disaster recovery with dual active accessor, automatic control path failover, and data path failover. Help ensure security and regulatory compliance with tape-drive encryption and Write Once Read Many (WORM) media. Support IBM LTO Ultrium 7, 6, and 5, IBM TS1155, TS1150, and TS1140 tape drives. Provide a flexible upgrade path for users who want to expand their tape storage as their needs grow. Reduce the storage footprint and simplify cabling with 10 U of rack space on top of the library. This guide is for anyone who wants to understand more about the IBM TS4500 tape library. It is particularly suitable for IBM clients, IBM Business Partners, IBM specialist sales representatives, and technical specialists.

Learn FileMaker Pro 16: The Comprehensive Guide to Building Custom Databases

2017-09-06 O'Reilly Amazon

book

Mark Conway Munro

data data-engineering filemaker Data Management JSON SQL

Extend FileMaker's built-in functionality and totally customize your data management environment with specialized functions and menus to super-charge the results and create a truly unique and focused experience. This book includes everything a beginner needs to get started building databases with FileMaker and contains advanced tips and techniques that the most seasoned professionals will appreciate. Written by a long time FileMaker developer, this book contains material for developers of every skill level. FileMaker Pro 16 is a powerful database development application used by millions of people in diverse industries to simplify data management tasks, leverage their business information in new ways and automate many mundane tasks. A custom solution built with FileMaker can quickly tap into a powerful set of capabilities and technologies to offer users an intuitive and pleasing environment in which to achieve new levels of efficiency and professionalism. What You’ll learn Create SQL queries to build fast and efficient formulas Discover new features of version 16 such as JSON functions, Cards, Layout Object window, SortValues, UniqueValues, using variables in Data Sources Write calculations using built-in and creating your own custom functions Discover the importance of a good approach to interface and technical design Apply best practices for naming conventions and usage standards Explore advanced topics about designing professional, open-ended solutions and using advanced techniques Who This Book Is For Casual programmers, full time consultants and IT professionals.

Modelling Business Information

2017-09-05 O'Reilly Amazon

book

Keith Gordon

data data-engineering data-models

This is an essential guide to entity relationship and class modelling for business analysts in line with, and beyond, the BCS Data Analysis syllabus.

Oracle ADF Survival Guide: Mastering the Application Development Framework

2017-09-04 O'Reilly Amazon

book

Sten Vesterli

data data-engineering oracle-database-solutions ADF Java Oracle

Quickly get up to speed with Oracle's Application Development Framework (ADF). Rapidly build modern, user-friendly applications that will be easy to re-use, expand, and maintain. Oracle ADF Survival Guide covers the latest 12c version and explains all the important concepts and parts, including ADF Faces, ADF Task Flows, ADF Business Components, ADF Skins, the new Alta UI, and how to implement business logic in all layers of the application. Organizations with existing investments in Oracle database and Oracle Forms applications will be able to leverage Oracle's best practice for application development in moving those applications to the ADF framework. The book: Explains all parts of the ADF stack Shows how to integrate with databases and web services Demonstrates the best practice for ADF enterprise architecture What You Will Learn Rapidly build great-looking, user-friendly screens Build page flows visually for improved communication with business users Easily connect your user interface to databases and other back-end systems Leverage the best practice for productive team development Establish a solid enterprise architecture for maximum reuse and maintainability Automate your build and deployment process Who This Book Is For Experienced developers who want to rapidly become productive with Oracle's Application Development Framework (ADF) 12c. It is for Oracle Forms and database developers working for organizations who have followed Oracle’s strategic direction to ADF, as well as for experienced Java developers who want to learn Oracle’s highly-productive, JSF framework.

EU General Data Protection Regulation (GDPR): An Implementation and Compliance Guide - Second edition

2017-08-31 O'Reilly Amazon

book

ITGP Privacy Team

data data-engineering data-security-privacy eu-general-data-protection-regulation-gdpr eu general data protection regulation (gdpr) Cloud Computing

The updated second edition of the bestselling guide to the changes your organisation needs to make to comply with the EU GDPR. “The clear language of the guide and the extensive explanations, help to explain the many doubts that arise reading the articles of the Regulation.” Giuseppe G. Zorzino The EU General Data Protection Regulation (GDPR) will supersede the 1995 EU Data Protection Directive (DPD) and all EU member states’ national laws based on it – including the UK Data Protection Act 1998 – in May 2018. All organisations – wherever they are in the world – that process the personal data of EU residents must comply with the Regulation. Failure to do so could result in fines of up to €20 million or 4% of annual global turnover. This book provides a detailed commentary on the GDPR, explains the changes you need to make to your data protection and information security regimes, and tells you exactly what you need to do to avoid severe financial penalties. Product overview Now in its second edition, EU GDPR – An Implementation and Compliance Guide is a clear and comprehensive guide to this new data protection law, explaining the Regulation, and setting out the obligations of data processors and controllers in terms you can understand. Topics covered include: The role of the data protection officer (DPO) – including whether you need one and what they should do. Risk management and data protection impact assessments (DPIAs), including how, when and why to conduct a DPIA. Data subjects’ rights, including consent and the withdrawal of consent; subject access requests and how to handle them; and data controllers’ and processors’ obligations. International data transfers to “third countries” – including guidance on adequacy decisions and appropriate safeguards; the EU-US Privacy Shield; international organisations; limited transfers; and Cloud providers. How to adjust your data protection processes to transition to GDPR compliance, and the best way of demonstrating that compliance. A full index of the Regulation to help you find the articles and stipulations relevant to your organisation. New for the second edition: Additional definitions. Further guidance on the role of the DPO. Greater clarification on data subjects’ rights. Extra guidance on data protection impact assessments. More detailed information on subject access requests (SARs). Clarification of consent and the alternative lawful bases for processing personal data. New appendix: implementation FAQ. The GDPR will have a significant impact on organisational data protection regimes around the world. EU GDPR – An Implementation and Compliance Guide shows you exactly what you need to do to comply with the new law.

IBM Tape Library Guide for Open Systems

2017-08-31 O'Reilly Amazon

book

Larry Coyne , Michael Engelbrecht , Simon Browne

data data-engineering IBM

Abstract This IBM® Redbooks® publication presents a general introduction to the latest IBM tape and tape library technologies. Featured tape technologies include the IBM LTO Ultrium and Enterprise 3592 tape drives, and their implementation in IBM tape libraries. This 14th edition includes information about the latest TS4300 Ultrium tape library and new TS1155 Enterprise tape drive, along with technical information about each IBM tape product for open systems, and includes generalized sections about Small Computer System Interface (SCSI) and Fibre Channel connections and multipath architecture configurations. This book also covers tools and techniques for library management. It is intended for anyone who wants to understand more about IBM tape products and their implementation. It is suitable for IBM clients, IBM Business Partners, IBM specialist sales representatives, and technical specialists. If you do not have a background in computer tape storage products, you might need to read other sources of information. In the interest of being concise, topics that are generally understood are not covered in detail.

Using IBM Spectrum Copy Data Management with IBM FlashSystem A9000 or A9000R and SAP HANA

2017-08-29 O'Reilly Amazon

book

Markus Oscheka , Axel Westphal , Bert Dufrasne

data data-engineering IBM Cloud Computing Data Management DevOps

Data is the currency of the new economy, and organizations are increasingly tasked with finding better ways to protect, recover, access, share, and use it. IBM Spectrum™ Copy Data Management is aimed at using existing data in a manner that is efficient, automated, scalable. It helps you manage all of those snapshot and IBM FlashCopy® images made to support DevOps, data protection, disaster recovery, and Hybrid Cloud computing environments. This IBM® Redpaper™ publication specifically addresses IBM Spectrum Copy Data Management in combination with IBM FlashSystem® A9000 or A9000R when used for Automated Disaster Recovery of SAP HANA.

Building Data Streaming Applications with Apache Kafka

2017-08-18 O'Reilly Amazon

book

Manisha Sethi , Anshul Joshi , Chanchal Singh , Manish Kumar

data data-engineering streaming-messaging Kafka Data Engineering Java

Learn how to design and build efficient real-time streaming applications using Apache Kafka, a leading distributed streaming platform. This book provides comprehensive guidance on setting up Kafka clusters, developing producers and consumers, and integrating with frameworks like Spark, Storm, and Heron. By the end, you'll master the skills needed to create enterprise-grade data streaming solutions. What this Book will help me do Grasp the core concepts and components of Apache Kafka and its ecosystem. Develop robust Kafka producers and consumers to process real-time data streams. Design and implement streaming applications using Spark, Storm, and Heron. Plan Kafka deployments with a focus on scalability, capacity, and fault tolerance. Ensure secure data streaming with best practices for securing Apache Kafka. Author(s) The authors, None Singh and None Kumar, bring years of expertise in data engineering and distributed systems. Having worked extensively with streaming technologies like Apache Kafka, they aim to share their in-depth knowledge through practical examples and real-world scenarios. Their approach to teaching focuses on making complex concepts easily understandable. Who is it for? This book is ideal for software developers and data engineers who are eager to learn Apache Kafka for building streaming applications. Some experience with programming, particularly Java, will help readers get the most out of the material. If you are working on data-processing systems or looking to enhance your skills in real-time data handling, this book caters to your needs.

Mastering Apache Storm

2017-08-16 O'Reilly Amazon

book

Ankit Jain

data data-engineering streaming-messaging storm Big Data Hadoop

Mastering Apache Storm is your step-by-step guide to mastering real-time data streaming with this robust framework. You'll learn how to process big data efficiently and integrate Apache Storm with popular technologies like Kafka, HBase, and Redis to maximize its potential. This book walks you through from basic concepts to advanced implementations of Apache Storm in real-world scenarios. What this Book will help me do Understand the core features and operation of Apache Storm for real-time data streaming. Integrate Apache Storm with other Big Data frameworks like Kafka, HBase, Redis, and Hadoop. Effectively deploy and manage multi-node Apache Storm clusters in real-world environments. Monitor and analyze your data streams and system health effectively using built-in and external tools. Learn to implement fault-tolerant, scalable, and distributed stream processing applications in Apache Storm. Author(s) None Jain is an experienced software developer and technical instructor specializing in distributed systems and real-time data processing. With years of experience working with Apache Storm and related technologies, their teachings focus on practical, hands-on learning to equip readers with actionable skills. Who is it for? This book is ideal for Java developers aspiring to build expertise in real-time data streaming and distributed processing applications using Apache Storm. Beginners can start with the fundamentals provided, while those with prior knowledge can delve into intermediate and advanced implementations.

Essentials of Cloud Application Development on IBM Bluemix

2017-08-07 O'Reilly Amazon

book

Hala A. Aziz , Ahmed Azraq , Sally Fikry , Ben Smith , Mohamed El-Khouly , Ahmed S. Hassan

data data-engineering IBM Agile/Scrum API Cloud Computing

Abstract This IBM® Redbooks® publication is based on the Presentations Guide of the course Essentials of Cloud Application Development on IBM Bluemix that was developed by the IBM Redbooks team in partnership with IBM Skills Academy Program. This course is designed to teach university students the basic skills that are required to develop, deploy, and test cloud-based applications that use the IBM Bluemix® cloud services. The primary target audience for this course is university students in undergraduate computer science and computer engineer programs with no previous experience working in cloud environments. However, anyone new to cloud computing can also benefit from this course. After completing this course, you should be able to accomplish the following tasks: Define cloud computing Describe the factors that lead to the adoption of cloud computing Describe the choices that developers have when creating cloud applications Describe infrastructure as a service, platform as a service, and software as a service Describe IBM Bluemix and its architecture Identify the runtimes and services that IBM Bluemix offers Describe IBM Bluemix infrastructure types Create an application in IBM Bluemix Describe the IBM Bluemix dashboard, catalog, and documentation features Explain how the application route is used to test an application from the browser Create services in IBM Bluemix Describe how to bind services to an application in IBM Bluemix Describe the environment variables that are used with IBM Bluemix services Explain what are IBM Bluemix organizations, domains, spaces, and users Describe how to create an IBM SDK for Node.js application that runs on IBM Bluemix Explain how to manage your IBM Bluemix account with the Cloud Foundry CLI Describe how to set up and use the IBM Bluemix plug-in for Eclipse Describe the role of Node.js for server-side scripting Describe IBM Bluemix DevOps Services and the capabilities of IBM DevOps Services Identify the Web IDE features in IBM Bluemix DevOps Describe how to connect a Git repository client to Bluemix DevOps Services project Explain the pipeline build and deploy processes that IBM Bluemix DevOps Services use Describe how IBM Bluemix DevOps Services integrate with the IBM Bluemix cloud Describe the agile planning tools in IBM Bluemix Describe the characteristics of REST APIs Explain the advantages of the JSON data format Describe an example of REST APIs using Watson Describe the main types of data services in IBM Bluemix Describe the benefits of IBM Cloudant® Explain how Cloudant databases and documents are accessed from IBM Bluemix Describe how to use REST APIs to interact with Cloudant database Describe Bluemix mobile backend as a service (MBaaS) and the MBaaS architecture Describe the Push Notifications service Describe the App ID service Describe the Kinetise service Describe how to create Bluemix Mobile applications by using MobileFirst Services Starter Boilerplate The workshop materials were created in June 2017. Therefore, all IBM Bluemix features that are described in this Presentations Guide and IBM Bluemix user interfaces that are used in the examples are current as of June 2017.

IBM Spectrum Archive Enterprise Edition V1.2.4: Installation and Configuration Guide

2017-08-05 O'Reilly Amazon

book

Wei Zheng Ong , Illarion Borisevich , Larry Coyne , Khanh Ngo , Stefan Neff

data data-engineering IBM

Abstract This IBM® Redbooks® publication helps you with the planning, installation, and configuration of the new IBM Spectrum™ Archive (formerly IBM Linear Tape File System™ (LTFS)) Enterprise Edition (EE) V1.2.4.0 for the IBM TS3310, IBM TS3500, and IBM TS4500 tape libraries. IBM Spectrum Archive™ EE enables the use of the LTFS for the policy management of tape as a storage tier in an IBM Spectrum Scale™ based environment and helps encourage the use of tape as a critical tier in the storage environment. This is the fourth edition of IBM Spectrum Archive V1.2 (SG24-8333) although it is based on the prior editions of IBM Linear Tape File System Enterprise Edition V1.1.1.2: Installation and Configuration Guide, SG24-8143. IBM Spectrum Archive EE can run any application that is designed for disk files on a physical tape media. IBM Spectrum Archive EE supports the IBM Linear Tape-Open (LTO) Ultrium 7, 6, and 5 tape drives in IBM TS3310, TS3500, and TS4500 tape libraries. In addition, IBM TS1155, TS1150, and TS1140 tape drives are supported in TS3500 and TS4500 tape library configurations. IBM Spectrum Archive EE can play a major role in reducing the cost of storage for data that does not need the access performance of primary disk. The use of IBM Spectrum Archive EE to replace disks with physical tape in tier 2 and tier 3 storage can improve data access over other storage solutions because it improves efficiency and streamlines management for files on tape. IBM Spectrum Archive EE simplifies the use of tape by making it transparent to the user and manageable by the administrator under a single infrastructure. This publication is intended for anyone who wants to understand more about IBM Spectrum Archive EE planning and implementation. This book is suitable for IBM clients, IBM Business Partners, IBM specialist sales representatives, and technical specialists.

Data Warehousing with Greenplum

2017-07-27 O'Reilly Amazon

book

Marshall Presser

data data-engineering storage-repositories data-warehouse Analytics BI

Relational databases haven’t gone away, but they are evolving to integrate messy, disjointed unstructured data into a cleansed repository for analytics. With the execution of massively parallel processing (MPP), the latest generation of analytic data warehouses is helping organizations move beyond business intelligence to processing a variety of advanced analytic workloads. These MPP databases expose their power with the familiarity of SQL. This report introduces the Greenplum Database, recently released as an open source project by Pivotal Software. Lead author Marshall Presser of Pivotal Data Engineering takes you through the Greenplum approach to data analytics and data-driven decisions, beginning with Greenplum’s shared-nothing architecture. You’ll explore data organization and storage, data loading, running queries, as well as performing analytics in the database. You’ll learn: How each networked node in Greenplum’s architecture features an independent operating system, memory, and storage Four deployment options to help you balance security, cost, and time to usability Ways to organize data, including distribution, storage, partitioning, and loading How to use Apache MADlib for in-database analytics, and GPText to process and analyze free-form text Tools for monitoring, managing, securing, and optimizing query responses available in the Pivotal Greenplum commercial database

Mastering Complexity

2017-07-27 O'Reilly Amazon

book

Stephen Denker

data data-engineering integration-solutions

The author covers fourteen tools to help you find the information you need and offers step-by-step instructions for constructing each one. He shows you how these tools can be combined with a set of simple problem-solving steps that can act as a powerful change agent to help reduce or eliminate process problems. Five-Step Problem-Solving Process Identify the problem: Clearly state what needs improvement. Analyze: Determine what causes the problem to occur. Evaluate Alternatives: Identify and select actions to reduce or eliminate the problem. Test Implement: Implement these actions on a trial basis to determine their effectiveness. Standardize: Ensure that useful actions are preserved.

Apache Spark 2.x for Java Developers

2017-07-26 O'Reilly Amazon

book

Sourav Gulati , Sumit Kumar

data data-engineering apache-spark AI/ML Analytics API

Delve into mastering big data processing with 'Apache Spark 2.x for Java Developers.' This book provides a practical guide to implementing Apache Spark using the Java APIs, offering a unique opportunity for Java developers to leverage Spark's powerful framework without transitioning to Scala. What this Book will help me do Learn how to process data from formats like XML, JSON, CSV using Spark Core. Implement real-time analytics using Spark Streaming and third-party tools like Kafka. Understand data querying with Spark SQL and master SQL schema processing. Apply machine learning techniques with Spark MLlib to real-world scenarios. Explore graph processing and analytics using Spark GraphX. Author(s) None Kumar and None Gulati, experienced professionals in Java development and big data, bring their wealth of practical experience and passion for teaching to this book. With a clear and concise writing style, they aim to simplify Spark for Java developers, making big data approachable. Who is it for? This book is perfect for Java developers who are eager to expand their skillset into big data processing with Apache Spark. Whether you are a seasoned Spark user or first diving into big data concepts, this book meets you at your level. With practical examples and straightforward explanations, you can unlock the potential of Spark in real-world scenarios.

IBM z14 Technical Introduction

2017-07-26 O'Reilly Amazon

book

Esra Ufacik , Frank Packheiser , John Troy , Bill White , Octavian Lascu , Michal Kordyzon , Hervey Kamga , Bo XU

data data-engineering IBM Agile/Scrum Analytics Cloud Computing

Abstract This IBM® Redpaper Redbooks® publication introduces the latest IBM Z platform, the IBM z14®. It includes information about the Z environment and how it helps integrate data and transactions more securely, and can infuse insight for faster and more accurate business decisions. The z14 is state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to the digital era and the trust economy. These capabilities include: - Securing data with pervasive encryption - Transforming a transactional platform into a data powerhouse - Getting more out of the platform with IT Operational Analytics - Providing resilience with key to zero downtime - Accelerating digital transformation with agile service delivery - Revolutionizing business processes - Blending open source and Z technologies This book explains how this system uses both new innovations and traditional Z strengths to satisfy growing demand for cloud, analytics, and security. With the z14 as the base, applications can run in a trusted, reliable, and secure environment that both improves operations and lessens business risk.

Mastering Apache Spark 2.x - Second Edition

2017-07-26 O'Reilly Amazon

book

Romeo Kienzler

data data-engineering apache-spark AI/ML Analytics Big Data

Mastering Apache Spark 2.x is the essential guide to harnessing the power of big data processing. Dive into real-time data analytics, machine learning, and cluster computing using Apache Spark's advanced features and modules like Spark SQL and MLlib. What this Book will help me do Gain proficiency in Spark's batch and real-time data processing with SparkSQL. Master techniques for machine learning and deep learning using SparkML and SystemML. Understand the principles of Spark's graph processing with GraphX and GraphFrames. Learn to deploy Apache Spark efficiently on platforms like Kubernetes and IBM Cloud. Optimize Spark cluster performance by configuring parameters effectively. Author(s) Romeo Kienzler is a seasoned professional in big data and machine learning technologies. With years of experience in cloud-based distributed systems, Romeo brings practical insights into leveraging Apache Spark. He combines his deep technical expertise with a clear and engaging writing style. Who is it for? This book is tailored for intermediate Apache Spark users eager to deepen their knowledge in Spark 2.x's advanced features. Ideal for data engineers and big data professionals seeking to enhance their analytics pipelines with Spark. A basic understanding of Spark and Scala is necessary. If you're aiming to optimize Spark for real-world applications, this book is crafted for you.

SQL Server 2016 High Availability Unleashed (includes Content Update Program)

2017-07-25 O'Reilly Amazon

book

Paul Bertucc , Raju Shreewastava

data data-engineering relational-databases microsoft-sql-server AWS Azure

Book + Content Update Program SQL Server 2016 High Availability Unleashed provides start-to-finish coverage of SQL Server’s powerful high availability (HA) solutions for your traditional on-premise databases, cloud-based databases (Azure or AWS), hybrid databases (on-premise coupled with the cloud), and your emerging Big Data solutions. This complete guide introduces an easy-to-follow, formal HA methodology that has been refined over the past several years and helps you identity the right HA solution for your needs. There is also additional coverage of both disaster recovery and business continuity architectures and considerations. You are provided with step-by-step guides, examples, and sample code to help you set up, manage, and administer these highly available solutions. All examples are based on existing production deployments at major Fortune 500 companies around the globe. This book is for all intermediate-to-advanced SQL Server and Big Data professionals, but is also organized so that the first few chapters are great foundation reading for CIOs, CTOs, and even some tech-savvy CFOs. Learn a formal, high availability methodology for understanding and selecting the right HA solution for your needs Deep dive into Microsoft Cluster Services Use selective data replication topologies Explore thorough details on AlwaysOn and availability groups Learn about HA options with log shipping and database mirroring/ snapshots Get details on Microsoft Azure for Big Data and Azure SQL Explore business continuity and disaster recovery Learn about on-premise, cloud, and hybrid deployments Provide all types of database needs, including online transaction processing, data warehouse and business intelligence, and Big Data Explore the future of HA and disaster recovery In addition, this book is part of InformIT’s exciting Content Update Program, which provides content updates for major technology improvements! As significant updates are made to SQL Server, sections of this book will be updated or new sections will be added to match the updates to the technologies. As updates become available, they will be delivered to you via a free Web Edition of this book, which can be accessed with any Internet connection. To learn more, visit informit.com/cup. How to access the Web Edition: Follow the instructions inside to learn how to register your book to access the FREE Web Edition. * The companion material is not available with the online edition on O'Reilly Learning

IBM Db2: Investigating Automatic Storage Table Spaces and Data Skew

2017-07-20 O'Reilly Amazon

book

George Wangelien , Zachary Hoggard

data data-engineering relational-databases ibm-db2 IBM

The scope of this IBM® Redpaper™ publication is to provide a high-level overview of automatic storage table spaces, table space maps, table space extent maps, and physically unbalanced data across automatic storage table space containers (that is, data skew). The objective of this paper is to investigate causes of data skew and make suggestions for how to resolve it. This paper is for Database Administrators (DBAs) of IBM Db2®; the DBAs should have general Db2 knowledge and skills. The environment used for the creation of this document is Db2 Version 11.1, and an IBM AIX® operating system. This document is based on results of testing various scenarios.

IBM Spectrum Accelerate Deployment, Usage, and Maintenance

2017-07-19 O'Reilly Amazon

book

Markus Oscheka , Bertrand Dufrasne , Abilio Oliveira , Grant Kabobel

data data-engineering IBM Agile/Scrum Cloud Computing

Abstract This edition applies to IBM® Spectrum Accelerate V11.5.4. IBM Spectrum Accelerate™, a member of IBM Spectrum Storage™, is an agile, software-defined storage solution for enterprise and cloud that builds on the customer-proven and mature IBM XIV® storage software. The key characteristic of Spectrum Accelerate is that it can be easily deployed and run on purpose-built or existing hardware that is chosen by the customer. IBM Spectrum Accelerate enables rapid deployment of high-performance and scalable block data storage infrastructure over commodity hardware on-premises or off-premises. This IBM Redbooks® publication provides a broad understanding of IBM Spectrum Accelerate. The book introduces Spectrum Accelerate and describes planning and preparation that are essential for a successful deployment of the solution. The deployment is described through a step-by-step approach, by using a graphical user interface (GUI) based method or a simple command-line interface (CLI) based procedure. Chapters in this book describe the logical configuration of the system, host support and business continuity functions, and migration. Although it makes many references to the XIV storage software, the book also emphasizes where IBM Spectrum Accelerate differs from XIV. Finally, a substantial portion of the book is dedicated to maintenance and troubleshooting to provide detailed guidance for the customer support personnel.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Practical Real-time Data Processing and Analytics

IBM z14 Configuration Setup

Apache Spark 2.x Machine Learning Cookbook

Kafka: The Definitive Guide

IBM zPDT 2017 Sysplex Extensions

High Availability for Oracle Database with IBM PowerHA SystemMirror and IBM Spectrum Virtualize HyperSwap

IBM TS4500 R4 Tape Library Guide

Learn FileMaker Pro 16: The Comprehensive Guide to Building Custom Databases

Modelling Business Information

Oracle ADF Survival Guide: Mastering the Application Development Framework

EU General Data Protection Regulation (GDPR): An Implementation and Compliance Guide - Second edition

IBM Tape Library Guide for Open Systems

Using IBM Spectrum Copy Data Management with IBM FlashSystem A9000 or A9000R and SAP HANA

Building Data Streaming Applications with Apache Kafka

Mastering Apache Storm

Essentials of Cloud Application Development on IBM Bluemix

IBM Spectrum Archive Enterprise Edition V1.2.4: Installation and Configuration Guide

Data Warehousing with Greenplum

Mastering Complexity

Apache Spark 2.x for Java Developers

IBM z14 Technical Introduction

Mastering Apache Spark 2.x - Second Edition

SQL Server 2016 High Availability Unleashed (includes Content Update Program)

IBM Db2: Investigating Automatic Storage Table Spaces and Data Skew

IBM Spectrum Accelerate Deployment, Usage, and Maintenance