O'Reilly Data Engineering Books

Logging in Action

2022-04-10 O'Reilly Amazon

book

Phil Wilkins

data data-engineering search elasticsearch elastic-stack-elk-stack elastic stack (elk stack)

Make log processing a real asset to your organization with powerful and free open source tools. In Logging in Action you will learn how to: Deploy Fluentd and Fluent Bit into traditional on-premises, IoT, hybrid, cloud, and multi-cloud environments, both small and hyperscaled Configure Fluentd and Fluent Bit to solve common log management problems Use Fluentd within Kubernetes and Docker services Connect a custom log source or destination with Fluentd’s extensible plugin framework Logging best practices and common pitfalls Logging in Action is a guide to optimize and organize logging using the CNCF Fluentd and Fluent Bit projects. You’ll use the powerful log management tool Fluentd to solve common log management, and learn how proper log management can improve performance and make management of software and infrastructure solutions easier. Through useful examples like sending log-driven events to Slack, you’ll get hands-on experience applying structure to your unstructured data. About the Technology Don’t fly blind! An effective logging system can help you see and correct problems before they cripple your software. With the Fluentd log management tool, it’s a snap to monitor the behavior and health of your software and infrastructure in real time. Designed to collect and process log data from multiple sources using the industry-standard JSON format, Fluentd delivers a truly unified logging layer across all your systems. About the Book Logging in Action teaches you to record and analyze application and infrastructure data using Fluentd. Using clear, relevant examples, it shows you exactly how to transform raw system data into a unified stream of actionable information. You’ll discover how logging configuration impacts the way your system functions and set up Fluentd to handle data from legacy IT environments, local data centers, and massive Kubernetes-driven distributed systems. You’ll even learn how to implement complex log parsing with RegEx and output events to MongoDB and Slack. What's Inside Capture log events from a wide range of systems and software, including Kubernetes and Docker Connect to custom log sources and destinations Employ Fluentd’s extensible plugin framework Create a custom plugin for niche problems About the Reader For developers, architects, and operations professionals familiar with the basics of monitoring and logging. About the Author Phil Wilkins has spent over 30 years in the software industry. Has worked for small startups through to international brands. Quotes I highly recommend using Logging in Action as a getting-started guide, a refresher, or as a way to optimize your logging journey. - From the Foreword by Anurag Gupta, Fluent maintainer and Cofounder, Calyptia Covers everything you need if you want to implement a logging system using open source technology such as Fluentd and Kubernetes. - Alex Saez, Naranja X A great exploration of the features and capabilities of Fluentd, along with very useful hands-on exercises. - George Thomas, Manhattan Associates A practical holistic guide to integrating logging into your enterprise architecture. - Satej Sahu, Honeywell

PostgreSQL 14 Administration Cookbook

2022-03-31 O'Reilly Amazon

book

Simon Riggs , Gianni Ciolli

data data-engineering relational-databases postgresql Cloud Computing Cyber Security

PostgreSQL 14 Administration Cookbook provides a hands-on guide to mastering the administration of PostgreSQL 14. With over 175 recipes, this book equips you with practical techniques to manage, secure, and optimize your PostgreSQL databases, ensuring they are robust and high-performing. What this Book will help me do Master managing PostgreSQL databases both on-premises and in the cloud efficiently. Implement effective backup and recovery strategies to secure your data. Leverage the latest features of PostgreSQL 14 to enhance your database workflows. Understand and apply best practices for maintaining high availability and performance. Troubleshoot real-world challenges with guided solutions and expert insights. Author(s) Simon Riggs and Gianni Ciolli are seasoned database experts with years of experience working with PostgreSQL. Simon is a PostgreSQL core team member, contributing his technical knowledge towards building robust database solutions, while Gianni brings a wealth of expertise in database administration and support. Together, they share a passion for making complex database concepts accessible and actionable. Who is it for? This book is for database administrators, data architects, and developers who manage PostgreSQL databases and are looking to deepen their knowledge. It is suitable for professionals with some experience in PostgreSQL who aim to maximize their database's performance and security, as well as for those new to the system seeking a comprehensive start. Readers with an interest in practical, problem-solving approaches to database management will greatly benefit from this cookbook.

IBM Power Systems Virtual Server Guide for IBM i

2022-03-30 O'Reilly Amazon

book

Dino Quintero , Sanjeev Chhabra , Sergio Leyva , Ahmad Y Hussein , Marcelos Avalos , Luis Eduardo Silva Viera , Gabriel Padilla Jimenez , Diego Kesselman , Bogdan Savu , Adriano Almeida , Luis Ferreira , Travis Siegfried , Michael Easlon , Jose Martin Abeleira , Deepak C Shetty

data data-engineering IBM ibm-power-systems Data Management

This IBM® Redbooks® publication delivers a how-to usage content perspective that describes deployment, networking, and data management tasks on the IBM Power Systems Virtual Server by using sample scenarios. During the content development, the team used available documentation, IBM Power Systems Virtual Server environment, and other software and hardware resources to document the following information: IBM Power Systems Virtual Server networking and data management deployment scenarios Migrations use case scenarios Backups case scenarios Disaster recovery case scenarios This book addresses topics for IT architects, IT specialists, developers, sellers, and anyone who wants to implement and manage workloads in the IBM Power Systems Virtual Server. This publication also describes transferring the how-to-skills to the technical teams, and solution guidance to the sales team. This book compliments the documentation that available at the IBM Documentation web page and aligns with the educational materials that are provided by IBM Garage for Systems Technical Education.

Grokking Streaming Systems

2022-03-27 O'Reilly Amazon

book

Ning Wang , Josh Fischer

data data-engineering streaming-messaging streaming-architecture IoT Java

A friendly, framework-agnostic tutorial that will help you grok how streaming systems work—and how to build your own! In Grokking Streaming Systems you will learn how to: Implement and troubleshoot streaming systems Design streaming systems for complex functionalities Assess parallelization requirements Spot networking bottlenecks and resolve back pressure Group data for high-performance systems Handle delayed events in real-time systems Grokking Streaming Systems is a simple guide to the complex concepts behind streaming systems. This friendly and framework-agnostic tutorial teaches you how to handle real-time events, and even design and build your own streaming job that’s a perfect fit for your needs. Each new idea is carefully explained with diagrams, clear examples, and fun dialogue between perplexed personalities! About the Technology Streaming systems minimize the time between receiving and processing event data, so they can deliver responses in real time. For applications in finance, security, and IoT where milliseconds matter, streaming systems are a requirement. And streaming is hot! Skills on platforms like Spark, Heron, and Kafka are in high demand. About the Book Grokking Streaming Systems introduces real-time event streaming applications in clear, reader-friendly language. This engaging book illuminates core concepts like data parallelization, event windows, and backpressure without getting bogged down in framework-specific details. As you go, you’ll build your own simple streaming tool from the ground up to make sure all the ideas and techniques stick. The helpful and entertaining illustrations make streaming systems come alive as you tackle relevant examples like real-time credit card fraud detection and monitoring IoT services. What's Inside Implement and troubleshoot streaming systems Design streaming systems for complex functionalities Spot networking bottlenecks and resolve backpressure Group data for high-performance systems About the Reader No prior experience with streaming systems is assumed. Examples in Java. About the Authors Josh Fischer and Ning Wang are Apache Committers, and part of the committee for the Apache Heron distributed stream processing engine. Quotes Very well-written and enjoyable. I recommend this book to all software engineers working on data processing. - Apoorv Gupta, Facebook Finally, a much-needed introduction to streaming systems—a must-read for anyone interested in this technology. - Anupam Sengupta, Red Hat Tackles complex topics in a very approachable manner. - Marc Roulleau, GIRO A superb resource for helping you grasp the fundamentals of open-source streaming systems. - Simon Verhoeven, Cronos Explains all the main streaming concepts in a friendly way. Start with this one! - Cicero Zandona, Calypso Technologies

IBM FlashSystem Safeguarded Copy Implementation Guide

2022-03-25 O'Reilly Amazon

book

Vasfi Gucer , Hemanand Gadgil , Andrew Greenfield , Jackson Shea

data data-engineering IBM

Safeguarded Copy function that is available with IBM® Spectrum Virtualize Version 8.4.2 supports the ability to create cyber-resilient point-in-time copies of volumes that cannot be changed or deleted through user errors, malicious actions, or ransomware attacks. The system integrates with IBM Copy Services Manager to provide automated backup copies and data recovery. This IBM Redpaper® publication introduces the features and functions of Safeguarded Copy function by using several examples. This document is aimed at pre-sales and post-sales technical support specialists and storage administrators.

Simplify Big Data Analytics with Amazon EMR

2022-03-25 O'Reilly Amazon

book

Sakti Mishra

data data-engineering apache-spark Analytics AWS Amazon EMR

Simplify Big Data Analytics with Amazon EMR is a thorough guide to harnessing Amazon's EMR service for big data processing and analytics. From distributed computation pipelines to real-time streaming analytics, this book provides hands-on knowledge and actionable steps for implementing data solutions efficiently. What this Book will help me do Understand the architecture and key components of Amazon EMR and how to deploy it effectively. Learn to configure and manage distributed data processing pipelines using Amazon EMR. Implement security and data governance best practices within the Amazon EMR ecosystem. Master batch ETL and real-time analytics techniques using technologies like Apache Spark. Apply optimization and cost-saving strategies to scalable data solutions. Author(s) Sakti Mishra is a seasoned data professional with extensive expertise in deploying scalable analytics solutions on cloud platforms like AWS. With a background in big data technologies and a passion for teaching, Sakti ensures practical insights accompany every concept. Readers will find his approach thorough, hands-on, and highly informative. Who is it for? This book is perfect for data engineers, data scientists, and other professionals looking to leverage Amazon EMR for scalable analytics. If you are familiar with Python, Scala, or Java and have some exposure to Hadoop or AWS ecosystems, this book will empower you to design and implement robust data pipelines efficiently.

Getting Started with Elastic Stack 8.0

2022-03-23 O'Reilly Amazon

book

Asjad Athick

data data-engineering search elasticsearch elastic-stack-elk-stack elastic stack (elk stack)

Discover how to harness the power of the Elastic Stack 8.0 to manage, analyze, and secure complex data environments. You will learn to combine components such as Elasticsearch, Kibana, Logstash, and more to build scalable and effective solutions for your organization. By focusing on hands-on implementations, this book ensures you can apply your knowledge to real-world use cases. What this Book will help me do Set up and manage Elasticsearch clusters tailored to various architecture scenarios. Utilize Logstash and Elastic Agent to ingest and process diverse data sources efficiently. Create interactive dashboards and data models in Kibana, enabling business intelligence insights. Implement secure and effective search infrastructures for enterprise applications. Deploy Elastic SIEM to fortify your organization's security against modern cybersecurity threats. Author(s) Asjad Athick is a seasoned technologist and author with expertise in developing scalable data solutions. With years of experience working with the Elastic Stack, Asjad brings a pragmatic approach to teaching complex architectures. His dedication to explaining technical concepts in an accessible manner makes this book a valuable resource for learners. Who is it for? This book is ideal for developers seeking practical knowledge in search, observability, and security solutions using Elastic Stack. Solutions architects who aim to design scalable data platforms will also benefit greatly. Even tech leads or managers keen to understand the Elastic Stack's impact on their operations will find the insights valuable. No prior experience with Elastic Stack is needed.

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

2022-03-22 O'Reilly Amazon

book

Scott Haines

data data-engineering apache-spark AI/ML Airflow Data Contracts

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey beginning with the basics of data ingestion, processing, and transformation, and ending up with an entire local data platform running Apache Spark, Apache Zeppelin, Apache Kafka, Redis, MySQL, Minio (S3), and Apache Airflow. Apache Spark applications solve a wide range of data problems from traditional data loading and processing to rich SQL-based analysis as well as complex machine learning workloads and even near real-time processing of streaming data. Spark fits well as a central foundation for any data engineering workload. This book will teach you to write interactive Spark applications using Apache Zeppelin notebooks, write and compilereusable applications and modules, and fully test both batch and streaming. You will also learn to containerize your applications using Docker and run and deploy your Spark applications using a variety of tools such as Apache Airflow, Docker and Kubernetes. Reading this book will empower you to take advantage of Apache Spark to optimize your data pipelines and teach you to craft modular and testable Spark applications. You will create and deploy mission-critical streaming spark applications in a low-stress environment that paves the way for your own path to production. What You Will Learn Simplify data transformation with Spark Pipelines and Spark SQL Bridge data engineering with machine learning Architect modular data pipeline applications Build reusable application components and libraries Containerize your Spark applications for consistency and reliability Use Docker and Kubernetes to deploy your Spark applications Speed up application experimentation using Apache Zeppelin and Docker Understand serializable structured data and data contracts Harness effective strategies for optimizing data in your data lakes Build end-to-end Spark structured streaming applications using Redis and Apache Kafka Embrace testing for your batch and streaming applications Deploy and monitor your Spark applications Who This Book Is For Professional software engineers who want to take their current skills and apply them to new and exciting opportunities within the data ecosystem, practicing data engineers who are looking for a guiding light while traversing the many challenges of moving from batch to streaming modes, data architects who wish to provide clear and concise direction for how best to harness anduse Apache Spark within their organization, and those interested in the ins and outs of becoming a modern data engineer in today's fast-paced and data-hungry world

IBM TS4500 R8 Tape Library Guide

2022-03-21 O'Reilly Amazon

book

Jesus Eduardo Cervantes Rolon , Ole Asmussen , Larry Coyne , Albrecht Friess , Robert Beiderbeck , Khanh Ngo , Fabian Corona Villarreal , Hans-Günther Hörhammer

data data-engineering IBM ELK Cyber Security

The IBM® TS4500 (TS4500) tape library is a next-generation tape solution that offers higher storage density and better integrated management than previous solutions. This IBM Redbooks® publication gives you a close-up view of the new IBM TS4500 tape library. In the TS4500, IBM delivers the density that today's and tomorrow's data growth requires. It has the cost-effectiveness and the manageability to grow with business data needs, while you preserve investments in IBM tape library products. Now, you can achieve a low per-terabyte cost and high density, with up to 13 PB of data (up to 39 PB compressed) in a single 10 square-foot library by using LTO Ultrium 9 cartridges or 11 PB with 3592 cartridges. The TS4500 offers the following benefits: Support of the IBM Linear Tape-Open (LTO) Ultrium 9 tape drive: Store up to 1.04 EB 2.5:1 compressed per library with IBM LTO 9 cartridges. High availability: Dual active accessors with integrated service bays reduce inactive service space by 40%. The Elastic Capacity option can be used to eliminate inactive service space. Flexibility to grow: The TS4500 library can grow from the right side and the left side of the first L frame because models can be placed in any active position. Increased capacity: The TS4500 can grow from a single L frame up to another 17 expansion frames with a capacity of over 23,000 cartridges. High-density (HD) generation 1 frames from the TS3500 library can be redeployed in a TS4500. Capacity on demand (CoD): CoD is supported through entry-level, intermediate, and base-capacity configurations. Advanced Library Management System (ALMS): ALMS supports dynamic storage management, which enables users to create and change logical libraries and configure any drive for any logical library. Support for IBM TS1160 while also supporting TS1155, TS1150, and TS1140 tape drive. The TS1160 gives organizations an easy way to deliver fast access to data, improve security, and provide long-term retention, all at a lower cost than disk solutions. The TS1160 offers high-performance, flexible data storage with support for data encryption. Also, this enhanced fifth-generation drive can help protect investments in tape automation by offering compatibility with existing automation. Store up to 1.05 EB 3:1 compressed per library with IBM 3592 cartridges Integrated TS7700 back-end Fibre Channel (FC) switches are available. Up to four library-managed encryption (LME) key paths per logical library are available. This book describes the TS4500 components, feature codes, specifications, supported tape drives, encryption, new integrated management console (IMC), command-line interface (CLI), and REST over SCSI (RoS) to obtain status information about library components. You learn how to accomplish the following tasks: Improve storage density with increased expansion frame capacity up to 2.4 times, and support 33% more tape drives per frame

Data Lakehouse in Action

2022-03-17 O'Reilly Amazon

book

Pradeep Menon

data data-engineering storage-repositories data-lake Analytics Azure

"Data Lakehouse in Action" provides a comprehensive exploration of the Data Lakehouse architecture, a modern solution for scalable and effective large-scale analytics. This book guides you through understanding the principles and components of the architecture, and its implementation using cloud platforms like Azure. Learn the practical techniques for designing robust systems tailored to organizational needs and maturity. What this Book will help me do Understand the evolution and need for modern data architecture patterns like Data Lakehouse. Learn how to design systems for data ingestion, storage, processing, and serving in a Data Lakehouse. Develop best practices for data governance and security in the Data Lakehouse architecture. Discover various analytics workflows enabled by the Data Lakehouse, including real-time and batch approaches. Implement practical Data Lakehouse patterns on a cloud platform, and integrate them with macro-patterns such as Data Mesh. Author(s) Pradeep Menon is a seasoned data architect and engineer with extensive experience implementing data analytics solutions for leading companies. With a penchant for simplifying complex architectures, Pradeep has authored several technical publications and frequently shares his expertise at industry conferences. His hands-on approach and passion for teaching shine through in his practical guides. Who is it for? This book is ideal for data professionals including architects, engineers, and data strategists eager to enhance their knowledge in modern analytics platforms. If you have a basic understanding of data architecture and are curious about implementing systems governed by the Data Lakehouse paradigm, this book is for you. It bridges foundational concepts with advanced practices, making it suitable for learners aiming to contribute effectively to their organization's analytics efforts.

IBM Spectrum Virtualize, IBM FlashSystem, and IBM SAN Volume Controller Security Feature Checklist

2022-03-16 O'Reilly Amazon

book

James Whitaker , Bill Scales , Barry Whyte

data data-engineering IBM Cloud Computing Cyber Security

IBM Spectrum® Virtualize based storage systems are secure storage platforms that implement various security-related features, in terms of system-level access controls and data-level security features. This document outlines the available security features and options of IBM Spectrum Virtualize based storage systems. It is not intended as a "how to" or best practice document. Instead, it is a checklist of features that can be reviewed by a user security team to aid in the definition of a policy to be followed when implementing IBM FlashSystem®, IBM SAN Volume Controller, and IBM Spectrum Virtualize for Public Cloud. The topics that are discussed in this paper can be broadly split into two categories: System security This type of security encompasses the first three lines of defense that prevent unauthorized access to the system, protect the logical configuration of the storage system, and restrict what actions users can perform. It also ensures visibility and reporting of system level events that can be used by a Security Information and Event Management (SIEM) solution, such as IBM QRadar®. Data security This type of security encompasses the fourth line of defense. It protects the data that is stored on the system against theft, loss, or attack. These data security features include, but are not limited to, encryption of data at rest (EDAR) or IBM Safeguarded Copy (SGC). This document is correct as of IBM Spectrum Virtualize version 8.5.0.

Data Analysis with Python and PySpark

2022-03-15 O'Reilly Amazon

book

Jonathan Rioux

data data-engineering apache-spark PySpark AI/ML Analytics

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines. In Data Analysis with Python and PySpark you will learn how to: Manage your data as it scales across multiple machines Scale up your data programs with full confidence Read and write data to and from a variety of sources and formats Deal with messy data with PySpark’s data manipulation functionality Discover new data sets and perform exploratory data analysis Build automated data pipelines that transform, summarize, and get insights from data Troubleshoot common PySpark errors Creating reliable long-running jobs Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required. About the Technology The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem. About the Book Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code. What's Inside Organizing your PySpark code Managing your data, no matter the size Scale up your data programs with full confidence Troubleshooting common data pipeline problems Creating reliable long-running jobs About the Reader Written for data scientists and data engineers comfortable with Python. About the Author As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts. Quotes A clear and in-depth introduction for truly tackling big data with Python. - Gustavo Patino, Oakland University William Beaumont School of Medicine The perfect way to learn how to analyze and master huge datasets. - Gary Bake, Brambles Covers both basic and more advanced topics of PySpark, with a good balance between theory and hands-on. - Philippe Van Bergenl, P² Consulting For beginner to pro, a well-written book to help understand PySpark. - Raushan Kumar Jha, Microsoft

Multimedia Security, Volume 1

2022-03-15 O'Reilly Amazon

book

William Puech

data data-engineering data-security-privacy data security & privacy Cyber Security

Today, more than 80% of the data transmitted over networks and archived on our computers, tablets, cell phones or clouds is multimedia data - images, videos, audio, 3D data. The applications of this data range from video games to healthcare, and include computer-aided design, video surveillance and biometrics. It is becoming increasingly urgent to secure this data, not only during transmission and archiving, but also during its retrieval and use. Indeed, in today’s "all-digital" world, it is becoming ever-easier to copy data, view it unrightfully, steal it or falsify it. Multimedia Security 1 analyzes the issues of the authentication of multimedia data, code and the embedding of hidden data, both from the point of view of defense and attack. Regarding the embedding of hidden data, it also covers invisibility, color, tracing and 3D data, as well as the detection of hidden messages in an image by steganalysis.

Getting Started with CockroachDB

2022-03-11 O'Reilly Amazon

book

Kishen Das Kondabagilu Rajanna

data data-engineering relational-databases cockroachdb Cloud Computing Cyber Security

"Getting Started with CockroachDB" provides an in-depth introduction to CockroachDB, a modern, distributed SQL database designed for cloud-native applications. Through this guide, you'll learn how to deploy, manage, and optimize CockroachDB to build highly reliable, scalable database solutions tailored for demanding and distributed workloads. What this Book will help me do Understand the architecture and design principles of CockroachDB and its fault-tolerant model. Learn how to set up and manage CockroachDB clusters for high availability and automatic scaling. Discover the concepts of data distribution and geo-partitioning to achieve low-latency global interactions. Explore indexing mechanisms in CockroachDB to optimize query performance for fast data retrieval. Master operational strategies, security configuration, and troubleshooting techniques for database management. Author(s) Kishen Das Kondabagilu Rajanna is an experienced software developer and database expert with a deep interest in distributed architectures. With hands-on experience working with CockroachDB and other database technologies, Kishen is passionate about sharing actionable insights with readers. His approach focuses on equipping developers with practical skills to excel in building and managing scalable, efficient database services. Who is it for? This book is ideal for software developers, database administrators, and database engineers seeking to learn CockroachDB for building robust, scalable database systems. If you're new to CockroachDB but possess basic database knowledge, this guide will equip you with the practical skills to leverage CockroachDB's capabilities effectively.

IBM Spectrum Archive Enterprise Edition V1.3.2.2: Installation and Configuration Guide

2022-03-10 O'Reilly Amazon

book

Yasuhiro Yoshihara , Larry Coyne , Yuka Sasaki , Arnold Byron Lua , Hiroyuki Miyoshi , Khanh Ngo

data data-engineering IBM

This IBM® Redbooks® publication helps you with the planning, installation, and configuration of the new IBM Spectrum® Archive Enterprise Edition (EE) Version 1.3.2.2 for the IBM TS4500, IBM TS3500, IBM TS4300, and IBM TS3310 tape libraries. IBM Spectrum Archive Enterprise Edition enables the use of the LTFS for the policy management of tape as a storage tier in an IBM Spectrum Scale based environment. It also helps encourage the use of tape as a critical tier in the storage environment. This edition of this publication is the tenth edition of IBM Spectrum Archive Installation and Configuration Guide. IBM Spectrum Archive EE can run any application that is designed for disk files on a physical tape media. IBM Spectrum Archive EE supports the IBM Linear Tape-Open (LTO) Ultrium 9, 8, 7, 6, and 5 tape drives. and the IBM TS1160, TS1155, TS1150, and TS1140 tape drives. IBM Spectrum Archive EE can play a major role in reducing the cost of storage for data that does not need the access performance of primary disk. The use of IBM Spectrum Archive EE to replace disks with physical tape in tier 2 and tier 3 storage can improve data access over other storage solutions because it improves efficiency and streamlines management for files on tape. IBM Spectrum Archive EE simplifies the use of tape by making it transparent to the user and manageable by the administrator under a single infrastructure. This publication is intended for anyone who wants to understand more about IBM Spectrum Archive EE planning and implementation. This book is suitable for IBM customers, IBM Business Partners, IBM specialist sales representatives, and technical specialists.

Databases Illuminated, 4th Edition

2022-03-09 O'Reilly Amazon

book

Karen C. Davis , Catherine M. Ricardo , Susan D. Urban

data data-engineering relational-databases database-theory

Databases Illuminated, Fourth Edition is designed to help students integrate theoretical material with practical knowledge, using an approach that applies theory to practical database implementation.

Data Mesh

2022-03-09 O'Reilly Amazon

book

Zhamak Dehghani

data data-engineering database-architecture data-mesh AI/ML Analytics

We're at an inflection point in data, where our data management solutions no longer match the complexity of organizations, the proliferation of data sources, and the scope of our aspirations to get value from data with AI and analytics. In this practical book, author Zhamak Dehghani introduces data mesh, a decentralized sociotechnical paradigm drawn from modern distributed architecture that provides a new approach to sourcing, sharing, accessing, and managing analytical data at scale. Dehghani guides practitioners, architects, technical leaders, and decision makers on their journey from traditional big data architecture to a distributed and multidimensional approach to analytical data management. Data mesh treats data as a product, considers domains as a primary concern, applies platform thinking to create self-serve data infrastructure, and introduces a federated computational model of data governance. Get a complete introduction to data mesh principles and its constituents Design a data mesh architecture Guide a data mesh strategy and execution Navigate organizational design to a decentralized data ownership model Move beyond traditional data warehouses and lakes to a distributed data mesh

Cyber Resilient Infrastructure: Detect, Protect, and Mitigate Threats Against Brocade SAN FOS with IBM QRadar

2022-03-02 O'Reilly Amazon

book

IBM Storage

data data-engineering IBM Fabric Python Cyber Security

Enterprise networks are large and rely on numerous connected endpoints to ensure smooth operational efficiency. However, they also present a challenge from a security perspective. The focus of this Blueprint is to demonstrate an early threat detection against the network fabric that is powered by Brocade that uses IBM® QRadar®. It also protects the same if a cyberattack or an internal threat by rouge user within the organization occurs. The publication also describes how to configure the syslog that is forwarding on Brocade SAN FOS. Finally, it explains how the forwarded audit events are used for detecting the threat and runs the custom action to mitigate the threat. The focus of this publication is to proactively start a cyber resilience workflow from IBM QRadar to block an IP address when multiple failed logins on Brocade switch are detected. As part of early threat detection, a sample rule that us used by IBM QRadar is shown. A Python script that also is used as a response to block the user's IP address in the switch is provided. Customers are encouraged to create control path or data path use cases, customized IBM QRadar rules, and custom response scripts that are best-suited to their environment. The use cases, QRadar rules, and Python script that are presented here are templates only and cannot be used as-is in an environment.

Snowflake Access Control: Mastering the Features for Data Privacy and Regulatory Compliance

2022-03-02 O'Reilly Amazon

book

Jessica Megan Larson

data data-engineering Snowflake Cloud Computing DWH GDPR/CCPA

Understand the different access control paradigms available in the Snowflake Data Cloud and learn how to implement access control in support of data privacy and compliance with regulations such as GDPR, APPI, CCPA, and SOX. The information in this book will help you and your organization adhere to privacy requirements that are important to consumers and becoming codified in the law. You will learn to protect your valuable data from those who should not see it while making it accessible to the analysts whom you trust to mine the data and create business value for your organization. Snowflake is increasingly the choice for companies looking to move to a data warehousing solution, and security is an increasing concern due to recent high-profile attacks. This book shows how to use Snowflake's wide range of features that support access control, making it easier to protect data access from the data origination point all the way to the presentation and visualization layer.Reading this book helps you embrace the benefits of securing data and provide valuable support for data analysis while also protecting the rights and privacy of the consumers and customers with whom you do business. What You Will Learn Identify data that is sensitive and should be restricted Implement access control in the Snowflake Data Cloud Choose the right access control paradigm for your organization Comply with CCPA, GDPR, SOX, APPI, and similar privacy regulations Take advantage of recognized best practices for role-based access control Prevent upstream and downstream services from subverting your access control Benefit from access control features unique to the Snowflake Data Cloud Who This Book Is For Data engineers, database administrators, and engineering managers who wantto improve their access control model; those whose access control model is not meeting privacy and regulatory requirements; those new to Snowflake who want to benefit from access control features that are unique to the platform; technology leaders in organizations that have just gone public and are now required to conform to SOX reporting requirements

Mastering Snowflake Solutions: Supporting Analytics and Data Sharing

2022-02-27 O'Reilly Amazon

book

Adam Morton

data data-engineering Snowflake Agile/Scrum Analytics BI

Design for large-scale, high-performance queries using Snowflake’s query processing engine to empower data consumers with timely, comprehensive, and secure access to data. This book also helps you protect your most valuable data assets using built-in security features such as end-to-end encryption for data at rest and in transit. It demonstrates key features in Snowflake and shows how to exploit those features to deliver a personalized experience to your customers. It also shows how to ingest the high volumes of both structured and unstructured data that are needed for game-changing business intelligence analysis. Mastering Snowflake Solutions starts with a refresher on Snowflake’s unique architecture before getting into the advanced concepts that make Snowflake the market-leading product it is today. Progressing through each chapter, you will learn how to leverage storage, query processing, cloning, data sharing, and continuous data protection features. This approach allows for greater operational agility in responding to the needs of modern enterprises, for example in supporting agile development techniques via database cloning. The practical examples and in-depth background on theory in this book help you unleash the power of Snowflake in building a high-performance system with little to no administrative overhead. Your result from reading will be a deep understanding of Snowflake that enables taking full advantage of Snowflake’s architecture to deliver value analytics insight to your business. What You Will Learn Optimize performance and costs associated with your use of the Snowflake data platform Enable data security to help in complying with consumer privacy regulations such as CCPA and GDPR Share data securely both inside your organization and with external partners Gain visibility to each interaction with your customersusing continuous data feeds from Snowpipe Break down data silos to gain complete visibility your business-critical processes Transform customer experience and product quality through real-time analytics Who This Book Is for Data engineers, scientists, and architects who have had some exposure to the Snowflake data platform or bring some experience from working with another relational database. This book is for those beginning to struggle with new challenges as their Snowflake environment begins to mature, becoming more complex with ever increasing amounts of data, users, and requirements. New problems require a new approach and this book aims to arm you with the practical knowledge required to take advantage of Snowflake’s unique architecture to get the results you need.

Analytics Optimization with Columnstore Indexes in Microsoft SQL Server: Optimizing OLAP Workloads

2022-02-26 O'Reilly Amazon

book

Edward Pollack

data data-engineering relational-databases microsoft-sql-server Analytics BI

Meet the challenge of storing and accessing analytic data in SQL Server in a fast and performant manner. This book illustrates how columnstore indexes can provide an ideal solution for storing analytic data that leads to faster performing analytic queries and the ability to ask and answer business intelligence questions with alacrity. The book provides a complete walk through of columnstore indexing that encompasses an introduction, best practices, hands-on demonstrations, explanations of common mistakes, and presents a detailed architecture that is suitable for professionals of all skill levels. With little or no knowledge of columnstore indexing you can become proficient with columnstore indexes as used in SQL Server, and apply that knowledge in development, test, and production environments. This book serves as a comprehensive guide to the use of columnstore indexes and provides definitive guidelines. You will learn when columnstore indexes shouldbe used, and the performance gains that you can expect. You will also become familiar with best practices around architecture, implementation, and maintenance. Finally, you will know the limitations and common pitfalls to be aware of and avoid. As analytic data can become quite large, the expense to manage it or migrate it can be high. This book shows that columnstore indexing represents an effective storage solution that saves time, money, and improves performance for any applications that use it. You will see that columnstore indexes are an effective performance solution that is included in all versions of SQL Server, with no additional costs or licensing required. What You Will Learn Implement columnstore indexes in SQL Server Know best practices for the use and maintenance of analytic data in SQL Server Use metadata to fully understand the size and shape of data stored in columnstore indexes Employ optimal ways to load, maintain, and delete data from large analytic tables Know how columnstore compression saves storage, memory, and time Understand when a columnstore index should be used instead of a rowstore index Be familiar with advanced features and analytics Who This Book Is For Database developers, administrators, and architects who are responsible for analytic data, especially for those working with very large data sets who are looking for new ways to achieve high performance in their queries, and those with immediate or future challenges to analytic data and query performance who want a methodical and effective solution

What Is Distributed SQL?

2022-02-25 O'Reilly Amazon

book

Charles Custer , Jim Walker , Paul Modderman

data data-engineering nosql-databases Cloud Computing ELK NoSQL

Globally available resources have become the status quo. They're accessible, distributed, and resilient. Our traditional SQL database options haven't kept up. Centralized SQL databases, even those with read replicas in the cloud, put all the transactional load on a central system. The further away that a transaction happens from the user, the more the user experience suffers. If the transactional data powering the application is greatly slowed down, fast-loading web pages mean nothing. In this report, Paul Modderman, Jim Walker, and Charles Custer explain how distributed SQL fits all applications and eliminates complex challenges like sharding from traditional RDBMS systems. You'll learn how distributed SQL databases can reach global scale without introducing the consistency trade-offs found in NoSQL solutions. These databases come to life through cloud computing, while legacy databases simply can't rise to meet the elastic and ubiquitous new paradigm. You'll learn: Key concepts driving this new technology, including the CAP theorem, the Raft consensus algorithm, multiversion concurrency control, and Google Spanner How distributed SQL databases meet enterprise requirements, including management, security, integration, and Everything as a Service (XaaS) The impact that distributed SQL has already made in the telecom, retail, and gaming industries Why serverless computing is an ideal fit for distributed SQL How distributed SQL can help you expand your company's strategic plan

Electronic Health Records with Epic and IBM FlashSystem 9500 Blueprint Version 2 Release 4

2022-02-23 O'Reilly Amazon

book

IBM

data data-engineering IBM

This information is intended to facilitate the deployment of IBM© FlashSystem© for the Epic Corporation electronic health record (EHR) solution by describing the requirements and specifications for configuring IBM FlashSystem 9500 and its parameters. This document also describes the required steps to configure the server that hosts the EHR application. To complete these tasks, you must be knowledgeable of IBM FlashSystem 9500 and Epic applications. This Blueprint provides the following information: A solutions architecture and the related solution configuration information for the following essential components of software and hardware: Detailed technical configuration steps for configuring IBM FlashSystem 9500 Server configuration details for Caché database and Epic applications

IBM DS8000 Easy Tier (Updated for DS8000 R9.0)

2022-02-23 O'Reilly Amazon

book

Peter Kimmel , Matthew Houzenga , Bertrand Dufrasne , Dennis Robertson

data data-engineering IBM

This IBM® Redpaper™ publication describes the concepts and functions of IBM System Storage® Easy Tier®, and explains its practical use with the IBM DS8000® series and License Machine Code 7.9.0.xxx (also known as R9.0).. Easy Tier is designed to automate data placement throughout the storage system disks pool. It enables the system to (automatically and without disruption to applications) relocate data (at the extent level) across up to three drive tiers. The process is fully automated. Easy Tier also automatically rebalances extents among ranks within the same tier, removing workload skew between ranks, even within homogeneous and single-tier extent pools. Easy Tier supports a Manual Mode that enables you to relocate full volumes. Manual Mode also enables you to merge extent pools and offers a rank depopulation function. Easy Tier fully supports thin-provisioned Extent Space Efficient fixed block (FB) and count key data (CKD) volumes in Manual Mode and Automatic Mode. Easy Tier also supports extent pools with small extents (16 MiB extents for FB pools and 21 cylinders extents for CKD pools). Easy Tier also supports high-performance and high-capacity flash drives in the High-performance flash enclosure, and it enables additional user controls at the pool and volume levels. This paper is aimed at those professionals who want to understand the Easy Tier concept and its underlying design. It also provides guidance and practical illustrations for users who want to use the Easy Tier Manual Mode capabilities. Easy Tier includes additional capabilities to further enhance your storage performance automatically: Easy Tier Application, and Easy Tier Heat Map Transfer.

IBM DS8900F Architecture and Implementation: Updated for Release 9.2

2022-02-22 O'Reilly Amazon

book

Connie Riggins , Lisa Martinez , Bertrand Dufrasne , Mike Stenson , Jeff Cook , Sherri Brunson

data data-engineering IBM AI/ML BI

This IBM® RedpaperRedbooks® publication describes the concepts, architecture, and implementation of the IBM DS8900F family. The WhitepaperRedpaperbook provides reference information to assist readers who need to plan for, install, and configure the DS8900F systems. This edition applies to DS8900F systems with IBM DS8000® Licensed Machine Code (LMC) 7.9.20 (bundle version 89.20.xx.x), referred to as Release 9.2. The DS8900F is an all-flash system exclusively, and it offers three classes: DS8980F: Analytic Class: The DS8980F Analytic Class offers best performance for organizations that want to expand their workload possibilities to artificial intelligence (AI), Business Intelligence (BI), and machine learning (ML). IBM DS8950F: Agility Class all-flash: The Agility Class consolidates all your mission-critical workloads for IBM Z®, IBM LinuxONE, IBM Power Systems, and distributed environments under a single all-flash storage solution.. IBM DS8910F: Flexibility Class all-flash: The Flexibility Class reduces complexity while addressing various workloads at the lowest DS8900F family entry cost. . TThe DS8900F architecture relies on powerful IBM POWER9™ processor-based servers that manage the cache to streamline disk input/output (I/O), which maximizes performance and throughput. These capabilities are further enhanced by High-Performance Flash Enclosures (HPFE) Gen2. Like its predecessors, the DS8900F supports advanced disaster recovery (DR) solutions, business continuity solutions, and thin provisioning. The IBM DS8910F Rack-Mounted model 993 is described in IBM DS8910F Model 993 Rack-Mounted Storage System Release 9.1, REDP-5566.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Logging in Action

PostgreSQL 14 Administration Cookbook

IBM Power Systems Virtual Server Guide for IBM i

Grokking Streaming Systems

IBM FlashSystem Safeguarded Copy Implementation Guide

Simplify Big Data Analytics with Amazon EMR

Getting Started with Elastic Stack 8.0

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

IBM TS4500 R8 Tape Library Guide

Data Lakehouse in Action

IBM Spectrum Virtualize, IBM FlashSystem, and IBM SAN Volume Controller Security Feature Checklist

Data Analysis with Python and PySpark

Multimedia Security, Volume 1

Getting Started with CockroachDB

IBM Spectrum Archive Enterprise Edition V1.3.2.2: Installation and Configuration Guide

Databases Illuminated, 4th Edition

Data Mesh

Cyber Resilient Infrastructure: Detect, Protect, and Mitigate Threats Against Brocade SAN FOS with IBM QRadar

Snowflake Access Control: Mastering the Features for Data Privacy and Regulatory Compliance

Mastering Snowflake Solutions: Supporting Analytics and Data Sharing

Analytics Optimization with Columnstore Indexes in Microsoft SQL Server: Optimizing OLAP Workloads

What Is Distributed SQL?

Electronic Health Records with Epic and IBM FlashSystem 9500 Blueprint Version 2 Release 4

IBM DS8000 Easy Tier (Updated for DS8000 R9.0)

IBM DS8900F Architecture and Implementation: Updated for Release 9.2