O'Reilly Data Engineering Books

Mastering Large Datasets with Python

2020-01-27 O'Reilly Amazon

book

John Wolohan

data data-engineering AI/ML AWS Cloud Computing Data Science

Modern data science solutions need to be clean, easy to read, and scalable. In Mastering Large Datasets with Python, author J.T. Wolohan teaches you how to take a small project and scale it up using a functionally influenced approach to Python coding. You’ll explore methods and built-in Python tools that lend themselves to clarity and scalability, like the high-performing parallelism method, as well as distributed technologies that allow for high data throughput. The abundant hands-on exercises in this practical tutorial will lock in these essential skills for any large-scale data science project. About the Technology Programming techniques that work well on laptop-sized data can slow to a crawl—or fail altogether—when applied to massive files or distributed datasets. By mastering the powerful map and reduce paradigm, along with the Python-based tools that support it, you can write data-centric applications that scale efficiently without requiring codebase rewrites as your requirements change. About the Book Mastering Large Datasets with Python teaches you to write code that can handle datasets of any size. You’ll start with laptop-sized datasets that teach you to parallelize data analysis by breaking large tasks into smaller ones that can run simultaneously. You’ll then scale those same programs to industrial-sized datasets on a cluster of cloud servers. With the map and reduce paradigm firmly in place, you’ll explore tools like Hadoop and PySpark to efficiently process massive distributed datasets, speed up decision-making with machine learning, and simplify your data storage with AWS S3. What's Inside An introduction to the map and reduce paradigm Parallelization with the multiprocessing module and pathos framework Hadoop and Spark for distributed computing Running AWS jobs to process large datasets About the Reader For Python programmers who need to work faster with more data. About the Author J. T. Wolohan is a lead data scientist at Booz Allen Hamilton, and a PhD researcher at Indiana University, Bloomington. Quotes A clear and efficient path to mastery of the map and reduce paradigm for developers of all levels. - Justin Fister, GrammarBot An amazing book for anybody looking to add parallel processing and the map/reduce pattern to their toolkit. - Gary Bake, Radius Payment Solutions Learn fundamentals of MapReduce and other core concepts and save money on expensive hardware! - Al Krinker, USPTO A comprehensive guide to the fundamentals of efficient Python data processing. - Craig Pfeifer, MITRE Corporation

IBM TS4500 R6 Tape Library Guide

2020-01-22 O'Reilly Amazon

book

Jesus Eduardo Cervantes Rolon , Larry Coyne , Robert Beiderbeck , Khanh Ngo , Jeremy Tudgay

data data-engineering IBM Cloud Computing ELK Cyber Security

The IBM® TS4500 (TS4500) tape library is a next-generation tape solution that offers higher storage density and integrated management than previous solutions. This IBM Redbooks® publication gives you a close-up view of the new IBM TS4500 tape library. In the TS4500, IBM delivers the density that today's and tomorrow's data growth requires. It has the cost-effectiveness and the manageability to grow with business data needs, while you preserve existing investments in IBM tape library products. Now, you can achieve both a low cost per terabyte (TB) and a high TB density per square foot because the TS4500 can store up to 11 petabytes (PB) of uncompressed data in a single frame library or scale up to 2 PB per square foot to over 350 PB. The TS4500 offers the following benefits: High availability: Dual active accessors with integrated service bays reduce inactive service space by 40%. The Elastic Capacity option can be used to completely eliminate inactive service space. Flexibility to grow: The TS4500 library can grow from the right side and the left side of the first L frame because models can be placed in any active position. Increased capacity: The TS4500 can grow from a single L frame up to another 17 expansion frames with a capacity of over 23,000 cartridges. High-density (HD) generation 1 frames from the TS3500 library can be redeployed in a TS4500. Capacity on demand (CoD): CoD is supported through entry-level, intermediate, and base-capacity configurations. Advanced Library Management System (ALMS): ALMS supports dynamic storage management, which enables users to create and change logical libraries and configure any drive for any logical library. Support for IBM TS1160 while also supporting TS1155, TS1150, and TS1140 tape drive: The TS1160 gives organizations an easy way to deliver fast access to data, improve security, and provide long-term retention, all at a lower cost than disk solutions. The TS1160 offers high-performance, flexible data storage with support for data encryption. Also, this enhanced fifth-generation drive can help protect investments in tape automation by offering compatibility with existing automation. The TS1160 Tape Drive Model 60E delivers a dual 10 Gb or 25 Gb Ethernet host attachment interface that is optimized for cloud-based and hyperscale environments. The TS1160 Tape Drive Model 60F delivers a native data rate of 400 MBps, the same load/ready, locate speeds, and access times as the TS1155, and includes dual-port 16 Gb Fibre Channel support. Support of the IBM Linear Tape-Open (LTO) Ultrium 8 tape drive: The LTO Ultrium 8 offering represents significant improvements in capacity, performance, and reliability over the previous generation, LTO Ultrium 7, while still protecting your investment in the previous technology. Support of LTO 8 Type M cartridge (M8): The LTO Program is introducing a new capability with LTO-8 drives. The ability of the LTO-8 drive to write 9 TB on a brand new LTO-7 cartridge instead of 6 TB as specified by the LTO-7 format. Such a cartridge is called an LTO-7 initialized LTO-8 Type M cartridge. Integrated TS7700 back-end Fibre Channel (FC) switches are available. Up to four library-managed encryption (LME) key paths per logical library are available. This book describes the TS4500 components, feature codes, specifications, supported tape drives, encryption, new integrated management console (IMC), command-line interface (CLI), and REST over SCSI (RoS) to obtain status information about library components. You learn how to accomplish the following specific tasks:: Improve storage density with increased expansion frame capacity up to 2.4 times and support 33% more tape drives per frame

IBM Storage for Red Hat OpenShift Blueprint Version 1 Release 3

2020-01-21 O'Reilly Amazon

book

IBM

data data-engineering IBM Cloud Computing DevOps Cyber Security

IBM Storage for Red Hat OpenShift is a comprehensive container-ready solution that includes all the hardware & software components necessary to setup and/or expand your Red Hat OpenShift environment. This blueprint includes Red Hat OpenShift Container Platform and uses Container Storage Interface (CSI) standards. IBM Storage brings enterprise data services to containers. In this blueprint, learn how to: · Combine the benefits of IBM Systems with the performance of IBM Storage solutions so that you can deliver the right services to your clients today! · Build a 24 by 7 by 365 enterprise class private cloud with Red Hat OpenShift Container Platform utilizing new open source Container Storage interface (CSI) drivers · Leverage enterprise class services such as NVMe based flash performance, high data availability, and advanced container security IBM Storage for Red Hat OpenShift Container Platform is designed for your DevOps environment for on-premises deployment with easy-to-consume components built to perform and scale for your enterprise. Simplify your journey to cloud with pre-tested and validated blueprints engineered to enable rapid deployment and peace of mind as you move to a hybrid multicloud environment. You now have the capabilities.

Apache Pulsar Versus Apache Kafka

2019-12-25 O'Reilly Amazon

book

Chris Bartholomew

data data-engineering apache-pulsar Cloud Computing Kafka Kubernetes

For nearly a decade, Apache Kafka has been the go-to publish-subscribe (pub-sub) messaging system—and for good reason. It offers functionality for a wide range of enterprise use cases, along with a large ecosystem of tools and a dedicated community. But lately, upstart Apache Pulsar has been gaining ground. This detailed report explains why. Apache Pulsar takes the best parts of Kafka and expands on them to solve problems that were out of scope of Kafka’s original design. Author Chris Bartholomew shows you how Kafka and Pulsar compare and where they differ. Engineers and other technical decision makers will learn the advantages that make Pulsar a compelling alternative to Kafka. Explore the architecture and major components of Kafka and Pulsar Discover the benefits of Pulsar’s subscription model for messaging Understand how Pulsar simplifies the messaging system for organizations that need high performance pub-sub messaging, delivery guarantees, and traditional messaging patterns Learn how Pulsar’s separation of serving and storing makes it natural to run in cloud native environments like Kubernetes See how Kafka and Pulsar perform on the OpenMessage Project benchmark

The Rise of Operational Analytics

2019-12-25 O'Reilly Amazon

book

Scott Haines

data data-engineering AI/ML Analytics Cloud Computing Kafka

Fast access to data has become a critical game changer. Today, a new breed of company understands that the faster they can build, access, and share well-defined datasets, the more competitive they’ll be in our data-driven world. In this practical report, Scott Haines from Twilio introduces you to operational analytics, a new approach for making sense of all the data flooding into business systems. Data architects and data scientists will see how Apache Kafka and other tools and processes laid the groundwork for fast analytics on a mix of historical and near-real-time data. You’ll learn how operational analytics feeds minute-by-minute customer interactions, and how NewSQL databases have entered the scene to drive machine learning algorithms, AI programs, and ongoing decision-making within an organization. Understand the key advantages that data-driven companies have over traditional businesses Explore the rise of operational analytics—and how this method relates to current tech trends Examine the impact of can’t wait business decisions and won’t wait customer experiences Discover how NewSQL databases support cloud native architecture and set the stage for operational databases Learn how to choose the right database to support operational analytics in your organization

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

2019-12-20 O'Reilly Amazon

book

Donna Strok , Dmitry Shirokov , Dmitry Anoshin

data data-engineering Snowflake Analytics AWS Azure

Explore the modern market of data analytics platforms and the benefits of using Snowflake computing, the data warehouse built for the cloud. With the rise of cloud technologies, organizations prefer to deploy their analytics using cloud providers such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform. Cloud vendors are offering modern data platforms for building cloud analytics solutions to collect data and consolidate into single storage solutions that provide insights for business users. The core of any analytics framework is the data warehouse, and previously customers did not have many choices of platform to use. Snowflake was built specifically for the cloud and it is a true game changer for the analytics market. This book will help onboard you to Snowflake, present best practices to deploy, and use the Snowflake data warehouse. In addition, it covers modern analytics architecture and use cases. It provides use cases of integration with leading analytics software such as Matillion ETL, Tableau, and Databricks. Finally, it covers migration scenarios for on-premise legacy data warehouses. What You Will Learn Know the key functionalities of Snowflake Set up security and access with cluster Bulk load data into Snowflake using the COPY command Migrate from a legacy data warehouse to Snowflake integrate the Snowflake data platform with modern business intelligence (BI) and data integration tools Who This Book Is For Those working with data warehouse and business intelligence (BI) technologies, and existing and potential Snowflake users

IBM Storage for Red Hat OpenShift Blueprint Version 1 Release 2

2019-12-17 O'Reilly Amazon

book

IBM

data data-engineering IBM Cloud Computing DevOps Cyber Security

IBM Storage for Red Hat OpenShift is a comprehensive container-ready solution that includes all the hardware & software components necessary to setup and/or expand your Red Hat OpenShift environment. This blueprint includes Red Hat OpenShift Container Platform and uses Container Storage Interface (CSI) standards. IBM Storage brings enterprise data services to containers. In this blueprint, learn how to: · Combine the benefits of IBM Systems with the performance of IBM Storage solutions so that you can deliver the right services to your clients today! · Build a 24 by 7 by 365 enterprise class private cloud with Red Hat OpenShift Container Platform utilizing new open source Container Storage interface (CSI) drivers · Leverage enterprise class services such as NVMe based flash performance, high data availability, and advanced container security IBM Storage for Red Hat OpenShift Container Platform is designed for your DevOps environment for on-premises deployment with easy-to-consume components built to perform and scale for your enterprise. Simplify your journey to cloud with pre-tested and validated blueprints engineered to enable rapid deployment and peace of mind as you move to a hybrid multicloud environment. You now have the capabilities.

Hands On Google Cloud SQL and Cloud Spanner: Deployment, Administration and Use Cases with Python

2019-12-16 O'Reilly Amazon

book

Shakuntala Gupta Edward , Navin Sabharwal

data data-engineering relational-databases google-cloud-sql Big Data Cloud Computing

Discover the methodologies and best practices for getting started with Google Cloud Platform relational services – CloudSQL and CloudSpanner. The book begins with the basics of working with the Google Cloud Platform along with an introduction to the database technologies available for developers from Google Cloud. You'll then take an in-depth hands on journey into Google CloudSQL and CloudSpanner, including choosing the right platform for your application needs, planning, provisioning, designing and developing your application. Sample applications are given that use Python to connect to CloudSQL and CloudSpanner, along with helpful features provided by the engines. You''ll also implement practical best practices in the last chapter. Hands On Google Cloud SQL and Cloud Spanner is a great starting point to apply GCP data offerings in your technology stack and the code used allows you to try out the examples and extend them in interestingways. What You'll Learn Get started with Big Data technologies on the Google Cloud Platform Review CloudSQL and Cloud Spanner from basics to administration Apply best practices and use Google’s CloudSQL and CloudSpanner offering Work with code in Python notebooks and scripts Who This Book Is For Application architects, database architects, software developers, data engineers, cloud architects.

Information Privacy Engineering and Privacy by Design: Understanding Privacy Threats, Technology, and Regulations Based on Standards and Best Practices

2019-12-12 O'Reilly Amazon

book

William Stallings

data data-engineering data-security-privacy data security & privacy Cloud Computing GDPR/CCPA

The Comprehensive Guide to Engineering and Implementing Privacy Best Practices As systems grow more complex and cybersecurity attacks more relentless, safeguarding privacy is ever more challenging. Organizations are increasingly responding in two ways, and both are mandated by key standards such as GDPR and ISO/IEC 27701:2019. The first approach, privacy by design, aims to embed privacy throughout the design and architecture of IT systems and business practices. The second, privacy engineering, encompasses the technical capabilities and management processes needed to implement, deploy, and operate privacy features and controls in working systems. In Information Privacy Engineering and Privacy by Design, internationally renowned IT consultant and author William Stallings brings together the comprehensive knowledge privacy executives and engineers need to apply both approaches. Using the techniques he presents, IT leaders and technical professionals can systematically anticipate and respond to a wide spectrum of privacy requirements, threats, and vulnerabilities–addressing regulations, contractual commitments, organizational policies, and the expectations of their key stakeholders. • Review privacy-related essentials of information security and cryptography • Understand the concepts of privacy by design and privacy engineering • Use modern system access controls and security countermeasures to partially satisfy privacy requirements • Enforce database privacy via anonymization and de-identification • Prevent data losses and breaches • Address privacy issues related to cloud computing and IoT • Establish effective information privacy management, from governance and culture to audits and impact assessment • Respond to key privacy rules including GDPR, U.S. federal law, and the California Consumer Privacy Act This guide will be an indispensable resource for anyone with privacy responsibilities in any organization, and for all students studying the privacy aspects of cybersecurity.

SQL Server Big Data Clusters: Early First Edition Based on Release Candidate 1

2019-11-26 O'Reilly Amazon

book

Benjamin Weissman , Enrico van de Laar

data data-engineering SQL AI/ML Analytics BI

Get a head-start on learning one of SQL Server 2019’s latest and most impactful features—Big Data Clusters—that combines large volumes of non-relational data for analysis along with data stored relationally inside a SQL Server database. This book provides a first look at Big Data Clusters based upon SQL Server 2019 Release Candidate 1. Start now and get a jump on your competition in learning this important new feature. Big Data Clusters is a feature set covering data virtualization, distributed computing, and relational databases and provides a complete AI platform across the entire cluster environment. This book shows you how to deploy, manage, and use Big Data Clusters. For example, you will learn how to combine data stored on the HDFS file system together with data stored inside the SQL Server instances that make up the Big Data Cluster. Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019 using Release Candidate 1. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark. Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis. What You Will Learn Install, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments Analyze large volumes of data directly from SQL Server and/or Apache Spark Manage data stored in HDFS from SQL Server as if it were relational data Implement advanced analytics solutions through machine learning and AI Expose different data sources as a single logical source using data virtualization Who This Book Is For For data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environment

SAP Landscape Management 3.0 and IBM Power Systems Servers

2019-11-07 O'Reilly Amazon

book

Arnold Beilmann , Edmund Haefele

data data-engineering SAP Cloud Computing IBM Linux

This IBM® Redpaper publication is part of a series of technical documentation to help the enablement of SAP on Linux for IBM Power Systems servers and IBM System Storage™ servers. This book describes how by using SAP Landscape Management (SAP LaMa) 3.0 software that clients gain full visibility and control over their SAP and non-SAP systems, including the underlying physical, virtual, and cloud infrastructures. With SAP LaMa, you can automate repetitive tasks to manage critical applications across complex, hybrid IT landscapes. This publication helps you to better control IT costs and increase business agility, for example, by freeing staff to focus on more strategic work rather than manual, error-prone tasks. The target audiences of this book are architects, IT specialists, and systems administrators deploying SAP LaMa 3.0 whom often spend much time and effort managing and provisioning SAP software systems and landscapes.

Google BigQuery: The Definitive Guide

2019-10-25 O'Reilly Amazon

book

Jordan Tigani , Valliappa Lakshmanan

data data-engineering google-bigquery Agile/Scrum BigQuery Cloud Computing

Work with petabyte-scale datasets while building a collaborative, agile workplace in the process. This practical book is the canonical reference to Google BigQuery, the query engine that lets you conduct interactive analysis of large datasets. BigQuery enables enterprises to efficiently store, query, ingest, and learn from their data in a convenient framework. With this book, you’ll examine how to analyze data at scale to derive insights from large datasets efficiently. Valliappa Lakshmanan, tech lead for Google Cloud Platform, and Jordan Tigani, engineering director for the BigQuery team, provide best practices for modern data warehousing within an autoscaled, serverless public cloud. Whether you want to explore parts of BigQuery you’re not familiar with or prefer to focus on specific tasks, this reference is indispensable.

IBM z15 Technical Introduction

2019-10-09 O'Reilly Amazon

book

Frank Packheiser , Jannie Houlbjerg , Kazuhiro Nakajima , John Troy , Bill White , Paul Schouten , Octavian Lascu , Anna Shugol , Hervey Kamga , Bo XU

data data-engineering IBM Agile/Scrum Analytics Cloud Computing

This IBM® Redbooks® publication introduces the latest member of the IBM Z® platform, the IBM z15™ (machine type 8561). It includes information about the Z environment and how it helps integrate data and transactions more securely. It also provides insight for faster and more accurate business decisions. The z15 is a state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to any digital transformation. The z15 is designed for enhanced modularity, which is in an industry-standard footprint. The z15 system excels at the following tasks: Using multicloud integration services Securing data with pervasive encryption Providing resilience with key to zero downtime Transforming a transactional platform into a data powerhouse Getting more out of the platform with IT Operational Analytics Accelerating digital transformation with agile service delivery Revolutionizing business processes Blending open source and Z technologies This book explains how this system uses new innovations and traditional Z strengths to satisfy growing demand for cloud, analytics, and open source technologies. With the z15 as the base, applications can run in a trusted, reliable, and secure environment that improves operations and lessens business risk.

IBM Storage for Red Hat OpenShift Blueprint Version 1 Release 1

2019-10-07 O'Reilly Amazon

book

IBM

data data-engineering IBM Cloud Computing DevOps Cyber Security

IBM Storage for Red Hat OpenShift is a comprehensive container-ready solution that includes all the hardware & software components necessary to setup and/or expand your Red Hat OpenShift environment. This blueprint includes Red Hat OpenShift Container Platform and uses Container Storage Interface (CSI) standards. IBM Storage brings enterprise data services to containers. In this blueprint, learn how to: · Combine the benefits of IBM Systems with the performance of IBM Storage solutions so that you can deliver the right services to your clients today! · Build a 24 by 7 by 365 enterprise class private cloud with Red Hat OpenShift Container Platform utilizing new open source Container Storage interface (CSI) drivers · Leverage enterprise class services such as NVMe based flash performance, high data availability, and advanced container security IBM Storage for Red Hat OpenShift Container Platform is designed for your DevOps environment for on-premises deployment with easy-to-consume components built to perform and scale for your enterprise. Simplify your journey to cloud with pre-tested and validated blueprints engineered to enable rapid deployment and peace of mind as you move to a hybrid multicloud environment. You now have the capabilities.

IBM Spectrum Discover: Metadata Management for Deep Insight of Unstructured Storage

2019-10-01 O'Reilly Amazon

book

Mathias Defiebre , Isom Crawford Jr. , Larry Coyne , Joseph Dain , Norman Bogard

data data-engineering IBM Analytics Cloud Computing

This IBM® Redpaper publication provides a comprehensive overview of the IBM Spectrum® Discover metadata management software platform. We give a detailed explanation of how the product creates, collects, and analyzes metadata. Several in-depth use cases are used that show examples of analytics, governance, and optimization. We also provide step-by-step information to install and set up the IBM Spectrum Discover trial environment. More than 80% of all data that is collected by organizations is not in a standard relational database. Instead, it is trapped in unstructured documents, social media posts, machine logs, and so on. Many organizations face significant challenges to manage this deluge of unstructured data such as: Pinpointing and activating relevant data for large-scale analytics Lacking the fine-grained visibility that is needed to map data to business priorities Removing redundant, obsolete, and trivial (ROT) data Identifying and classifying sensitive data IBM Spectrum Discover is a modern metadata management software that provides data insight for petabyte-scale file and Object Storage, storage on premises, and in the cloud. This software enables organizations to make better business decisions and gain and maintain a competitive advantage. IBM Spectrum Discover provides a rich metadata layer that enables storage administrators, data stewards, and data scientists to efficiently manage, classify, and gain insights from massive amounts of unstructured data. It improves storage economics, helps mitigate risk, and accelerates large-scale analytics to create competitive advantage and speed critical research.

IBM Storage for Red Hat OpenShift Container Platform V3.11 Blueprint Version 1 Release 1

2019-09-29 O'Reilly Amazon

book

IBM

data data-engineering IBM Cloud Computing DevOps Cyber Security

IBM Storage for Red Hat OpenShift Container Platform is a comprehensive container-ready solution that includes all the hardware & software components necessary to setup and/or expand your Red Hat OpenShift Container Platform V3.11 environment. IBM Storage, bringing enterprise data services to containers. In this blueprint, learn how to: • Combine the benefits of IBM Systems with the performance of IBM Storage solutions so that you can deliver the right services to your clients today! • Build a 24 by 7 by 365 enterprise class private cloud with Red Hat OpenShift Container Platform • Leverage enterprise class services such as NVMe based flash performance, high data availability, and advanced container security IBM Storage for Red Hat OpenShift Container Platform: designed for your DevOps environment for on-premises deployment with easy-to-consume components built to perform and scale for your enterprise. Simplify your journey to cloud with pre-tested and validated blueprints engineered to enable rapid deployment and peace of mind as you move to a hybrid multicloud environment. You now have the capabilities.

IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences

2019-09-08 O'Reilly Amazon

book

Dino Quintero , Frank N. Lee

data data-engineering IBM AI/ML Analytics Big Data

This IBM® Redpaper publication provides an update to the original description of IBM Reference Architecture for Genomics. This paper expands the reference architecture to cover all of the major vertical areas of healthcare and life sciences industries, such as genomics, imaging, and clinical and translational research. The architecture was renamed IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences to reflect the fact that it incorporates key building blocks for high-performance computing (HPC) and software-defined storage, and that it supports an expanding infrastructure of leading industry partners, platforms, and frameworks. The reference architecture defines a highly flexible, scalable, and cost-effective platform for accessing, managing, storing, sharing, integrating, and analyzing big data, which can be deployed on-premises, in the cloud, or as a hybrid of the two. IT organizations can use the reference architecture as a high-level guide for overcoming data management challenges and processing bottlenecks that are frequently encountered in personalized healthcare initiatives, and in compute-intensive and data-intensive biomedical workloads. This reference architecture also provides a framework and context for modern healthcare and life sciences institutions to adopt cutting-edge technologies, such as cognitive life sciences solutions, machine learning and deep learning, Spark for analytics, and cloud computing. To illustrate these points, this paper includes case studies describing how clients and IBM Business Partners alike used the reference architecture in the deployments of demanding infrastructures for precision medicine. This publication targets technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing life sciences solutions and support.

Simplify Management of IT Security and Compliance with IBM PowerSC in Cloud and Virtualized Environments

2019-09-07 O'Reilly Amazon

book

Dino Quintero , Faraz Ahmad , David Pontes , Cesar Rodriguez , Stephen Dominguez

data data-engineering IBM Cloud Computing Cyber Security

This IBM® Redbooks® publication provides a security and compliance solution that is optimized for virtualized environments on IBM Power Systems™ servers, running IBM PowerVM® and IBM AIX®. Security control and compliance are some of the key components that are needed to defend the virtualized data center and cloud infrastructure against ever evolving new threats. The IBM business-driven approach to enterprise security that is used with solutions, such as IBM PowerSC™, makes IBM the premier security vendor in the market today. The book explores, tests, and documents scenarios using IBM PowerSC that leverage IBM Power Systems servers architecture and software solutions from IBM to help defend the virtualized data center and cloud infrastructure against ever evolving new threats. This publication helps IT and Security managers, architects, and consultants to strengthen their security and compliance posture in a virtualized environment running IBM PowerVM.

Securing Your Cloud: IBM Security for LinuxONE

2019-08-01 O'Reilly Amazon

book

Klaus Egeler , Felipe Cardeneti Mendes , Maciej Olejniczak , Karen Medhat Fahmy , Lydia Parziale , Edi Lopes Alves

data data-engineering IBM Cloud Computing Linux Cyber Security

As workloads are being offloaded to IBM® LinuxONE based cloud environments, it is important to ensure that these workloads and environments are secure. This IBM Redbooks® publication describes the necessary steps to secure your environment from the hardware level through all of the components that are involved in a LinuxONE cloud infrastructure that use Linux and IBM z/VM®. The audience for this book is IT architects, IT Specialists, and those users who plan to use LinuxONE for their cloud environments.

Deploying a Database Instance in an IBM Cloud Private Cluster on IBM Z

2019-07-29 O'Reilly Amazon

book

Christian May

data data-engineering IBM Cloud Computing Docker Kubernetes

This IBM® Redpaper™ publication shows you how to deploy a database instance within a container using an IBM Cloud™ Private cluster on IBM Z®. A preinstalled IBM Spectrum™ Scale 5.0.3 cluster file system provides back-end storage for the persistent volumes bound to the database. A container is a standard unit of software that packages code and all its dependencies, so the application runs quickly and reliably from one computing environment to another. By default, containers are ephemeral. However, stateful applications, such as databases, require some type of persistent storage that can survive service restarts or container crashes. IBM provides several products helping organizations build an environment on an IBM Z infrastructure to develop and manage containerized applications, including dynamic provisioning of persistent volumes. As an example for a stateful application, this paper describes how to deploy the relational database MariaDB using a Helm chart. The IBM Spectrum Scale V5.0.3 cluster file system is providing back-end storage for the persistent volumes. This document provides step-by-step guidance regarding how to install and configure the following components: IBM Cloud Private 3.1.2 (including Kubernetes) Docker 18.03.1-ce IBM Storage Enabler for Containers 2.0.0 and 2.1.0 This Redpaper demonstrates how we set up the example for a stateful application in our lab. The paper gives you insights about planning for your implementation. IBM Z server hardware, the IBM Z hypervisor z/VM®, and the IBM Spectrum Scale cluster file system are prerequisites to set up the example environment. The Redpaper is written with the assumption that you have familiarity with and basic knowledge of the software products used in setting up the environment. The intended audience includes the following roles: Storage administrators IT/Cloud administrators Technologists IT specialists

Operationalizing the Data Lake

2019-07-25 O'Reilly Amazon

book

Jon King , Holden Ackerman

data data-engineering storage-repositories data-lake Analytics Big Data

Big data and advanced analytics have increasingly moved to the cloud as organizations pursue actionable insights and data-driven products using the growing amounts of information they collect. But few companies have truly operationalized data so it’s usable for the entire organization. With this pragmatic ebook, engineers, architects, and data managers will learn how to build and extract value from a data lake in the cloud and leverage the compute power and scalability of a cloud-native data platform to put your company’s vast data trove into action. Holden Ackerman and Jon King of Qubole take you through the basics of building a data lake operation, from people to technology, employing multiple technologies and frameworks in a cloud-native data platform. You'll dive into the tools and processes you need for the entire lifecycle of a data lake, from data preparation, storage, and management to distributed computing and analytics. You’ll also explore the unique role that each member of your data team needs to play as you migrate to your cloud-native data platform. Leverage your data effectively through a single source of truth Understand the importance of building a self-service culture for your data lake Define the structure you need to build a data lake in the cloud Implement financial governance and data security policies for your data lake through a cloud-native data platform Identify the tools you need to manage your data infrastructure Delineate the scope, usage rights, and best tools for each team working with a data lake—analysts, data scientists, data engineers, and security professionals, among others

Rebuilding Reliable Data Pipelines Through Modern Tools

2019-07-25 O'Reilly Amazon

book

Ted Malaska

data data-engineering AI/ML Big Data Cloud Computing DataOps

When data-driven applications fail, identifying the cause is both challenging and time-consuming—especially as data pipelines become more and more complex. Hunting for the root cause of application failure from messy, raw, and distributed logs is difficult for performance experts and a nightmare for data operations teams. This report examines DataOps processes and tools that enable you to manage modern data pipelines efficiently. Author Ted Malaska describes a data operations framework and shows you the importance of testing and monitoring to plan, rebuild, automate, and then manage robust data pipelines—whether it’s in the cloud, on premises, or in a hybrid configuration. You’ll also learn ways to apply performance monitoring software and AI to your data pipelines in order to keep your applications running reliably. You’ll learn: How performance management software can reduce the risk of running modern data applications Methods for applying AI to provide insights, recommendations, and automation to operationalize big data systems and data applications How to plan, migrate, and operate big data workloads and data pipelines in the cloud and in hybrid deployment models

Professional Azure SQL Database Administration - Second Edition

2019-07-19 O'Reilly Amazon

book

Ahmad Osama

data data-engineering relational-databases azure-sql-database Azure Cloud Computing

Professional Azure SQL Database Administration serves as your comprehensive guide to mastering the management and optimization of cloud-based Azure SQL Database solutions. With the differences and unique features of Azure SQL Database compared to the on-premise SQL Server, this book offers a clear roadmap to efficiently migrate, secure, scale, and maintain these databases in the cloud. What this Book will help me do Understand the differences between Azure SQL Database and on-premise SQL Server and their practical implications. Learn techniques to migrate existing SQL Server databases to Azure SQL Database seamlessly. Discover advanced ways to optimize database performance and scalability leveraging cloud capabilities. Master security strategies for Azure SQL databases, including backup, disaster recovery, and automated tasks. Develop proficiency in using tools such as PowerShell to automate and manage routine database administration tasks. Author(s) Ahmad Osama is an experienced database professional and author specializing in SQL Server and Azure SQL Database administration. With a robust background in database migration, maintenance, and performance tuning, Ahmad expertly bridges the gap between theory and practice. His approachable writing style makes complex database topics accessible to professionals seeking to expand their expertise. Who is it for? Professional Azure SQL Database Administration is an essential resource for database administrators, developers, and IT professionals keen on developing their knowledge about Azure SQL Database administration and cloud database solutions. Whether you're transitioning from traditional SQL Server environments or looking to optimize your database strategies in the cloud, this book caters to professionals with intermediate to advanced experience in database management and programming with SQL.

IBM Spectrum Scale: Big Data and Analytics Solution Brief

2019-07-17 O'Reilly Amazon

book

Wei G. Gong , Sandeep R Patil

data data-engineering IBM Analytics Big Data Cloud Computing

This IBM® Redguide™ publication describes big data and analytics deployments that are built on IBM Spectrum Scale™. IBM Spectrum Scale is a proven enterprise-level distributed file system that is a high-performance and cost-effective alternative to Hadoop Distributed File System (HDFS) for Hadoop analytics services. IBM Spectrum Scale includes NFS, SMB, and Object services and meets the performance that is required by many industry workloads, such as technical computing, big data, analytics, and content management. IBM Spectrum Scale provides world-class, web-based storage management with extreme scalability, flash accelerated performance, and automatic policy-based storage tiering from flash through disk to the cloud, which reduces storage costs up to 90% while improving security and management efficiency in cloud, big data, and analytics environments. This Redguide publication is intended for technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing Hadoop analytics services and are interested in learning about the benefits of the use of IBM Spectrum Scale as an alternative to HDFS.

IBM FlashSystem 900 Model AE3 Product Guide

2019-07-09 O'Reilly Amazon

book

Eike Schenk , Jon Herd , Detlef Helmbrecht

data data-engineering IBM Analytics Cloud Computing

Today's global organizations depend on the ability to unlock business insights from massive volumes of data. Now, with IBM® FlashSystem 900 Model AE3, they can make faster decisions based on real-time insights. Thus, they unleash the power of demanding applications, including these: Online transaction processing (OLTP) and analytical databases Virtual desktop infrastructures (VDIs) Technical computing applications Cloud environments Easy to deploy and manage, IBM FlashSystem® 900 Model AE3 is designed to accelerate the applications that drive your business. Powered by IBM FlashCore® Technology, IBM FlashSystem Model AE3 provides the following characteristics: Accelerate business-critical workloads, real-time analytics, and cognitive applications with the consistent microsecond latency and extreme reliability of IBM FlashCore technology Improve performance and help lower cost with new inline data compression Help reduce capital and operational expenses with IBM enhanced 3D triple-level cell (3D TLC) flash Protect critical data assets with patented IBM Variable Stripe RAID™ Power faster insights with IBM FlashCore including hardware-accelerated nonvolatile memory (NVM) architecture, purpose-engineered IBM MicroLatency® modules and advanced flash management FlashSystem 900 Model AE3 can be configured in capacity points as low as 14.4 TB to 180 TB usable and up to 360 TB effective capacity after RAID 5 protection and compression. You can couple this product with either 16 Gbps, 8 Gbps Fibre Channel, 16 Gbps NVMe over Fibre Channel, or 40 Gbps InfiniBand connectivity. Thus, the IBM FlashSystem 900 Model AE3 provides extreme performance to existing and next generation infrastructure.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Mastering Large Datasets with Python

IBM TS4500 R6 Tape Library Guide

IBM Storage for Red Hat OpenShift Blueprint Version 1 Release 3

Apache Pulsar Versus Apache Kafka

The Rise of Operational Analytics

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

IBM Storage for Red Hat OpenShift Blueprint Version 1 Release 2

Hands On Google Cloud SQL and Cloud Spanner: Deployment, Administration and Use Cases with Python

Information Privacy Engineering and Privacy by Design: Understanding Privacy Threats, Technology, and Regulations Based on Standards and Best Practices

SQL Server Big Data Clusters: Early First Edition Based on Release Candidate 1

SAP Landscape Management 3.0 and IBM Power Systems Servers

Google BigQuery: The Definitive Guide

IBM z15 Technical Introduction

IBM Storage for Red Hat OpenShift Blueprint Version 1 Release 1

IBM Spectrum Discover: Metadata Management for Deep Insight of Unstructured Storage

IBM Storage for Red Hat OpenShift Container Platform V3.11 Blueprint Version 1 Release 1

IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences

Simplify Management of IT Security and Compliance with IBM PowerSC in Cloud and Virtualized Environments

Securing Your Cloud: IBM Security for LinuxONE

Deploying a Database Instance in an IBM Cloud Private Cluster on IBM Z

Operationalizing the Data Lake

Rebuilding Reliable Data Pipelines Through Modern Tools

Professional Azure SQL Database Administration - Second Edition

IBM Spectrum Scale: Big Data and Analytics Solution Brief

IBM FlashSystem 900 Model AE3 Product Guide