O'Reilly Data Engineering Books

Logging in Action

2022-04-10 O'Reilly Amazon

book

Phil Wilkins

data data-engineering search elasticsearch elastic-stack-elk-stack elastic stack (elk stack)

Make log processing a real asset to your organization with powerful and free open source tools. In Logging in Action you will learn how to: Deploy Fluentd and Fluent Bit into traditional on-premises, IoT, hybrid, cloud, and multi-cloud environments, both small and hyperscaled Configure Fluentd and Fluent Bit to solve common log management problems Use Fluentd within Kubernetes and Docker services Connect a custom log source or destination with Fluentd’s extensible plugin framework Logging best practices and common pitfalls Logging in Action is a guide to optimize and organize logging using the CNCF Fluentd and Fluent Bit projects. You’ll use the powerful log management tool Fluentd to solve common log management, and learn how proper log management can improve performance and make management of software and infrastructure solutions easier. Through useful examples like sending log-driven events to Slack, you’ll get hands-on experience applying structure to your unstructured data. About the Technology Don’t fly blind! An effective logging system can help you see and correct problems before they cripple your software. With the Fluentd log management tool, it’s a snap to monitor the behavior and health of your software and infrastructure in real time. Designed to collect and process log data from multiple sources using the industry-standard JSON format, Fluentd delivers a truly unified logging layer across all your systems. About the Book Logging in Action teaches you to record and analyze application and infrastructure data using Fluentd. Using clear, relevant examples, it shows you exactly how to transform raw system data into a unified stream of actionable information. You’ll discover how logging configuration impacts the way your system functions and set up Fluentd to handle data from legacy IT environments, local data centers, and massive Kubernetes-driven distributed systems. You’ll even learn how to implement complex log parsing with RegEx and output events to MongoDB and Slack. What's Inside Capture log events from a wide range of systems and software, including Kubernetes and Docker Connect to custom log sources and destinations Employ Fluentd’s extensible plugin framework Create a custom plugin for niche problems About the Reader For developers, architects, and operations professionals familiar with the basics of monitoring and logging. About the Author Phil Wilkins has spent over 30 years in the software industry. Has worked for small startups through to international brands. Quotes I highly recommend using Logging in Action as a getting-started guide, a refresher, or as a way to optimize your logging journey. - From the Foreword by Anurag Gupta, Fluent maintainer and Cofounder, Calyptia Covers everything you need if you want to implement a logging system using open source technology such as Fluentd and Kubernetes. - Alex Saez, Naranja X A great exploration of the features and capabilities of Fluentd, along with very useful hands-on exercises. - George Thomas, Manhattan Associates A practical holistic guide to integrating logging into your enterprise architecture. - Satej Sahu, Honeywell

Data Engineering with Google Cloud Platform

2022-03-31 O'Reilly Amazon

book

Adi Wijaya

it-operations cloud-computing cloud-platforms google-cloud AI/ML Airflow

In 'Data Engineering with Google Cloud Platform', you'll explore how to construct efficient, scalable data pipelines using GCP services. This hands-on guide covers everything from building data warehouses to deploying machine learning pipelines, helping you master GCP's ecosystem. What this Book will help me do Build comprehensive data ingestion and transformation pipelines using BigQuery, Cloud Storage, and Dataflow. Design end-to-end orchestration flows with Airflow and Cloud Composer for automated data processing. Leverage Pub/Sub for building real-time event-driven systems and streaming architectures. Gain skills to design and manage secure data systems with IAM and governance strategies. Prepare for and pass the Professional Data Engineer certification exam to elevate your career. Author(s) Adi Wijaya is a seasoned data engineer with significant experience in Google Cloud Platform products and services. His expertise in building data systems has equipped him with insights into the real-world challenges data engineers face. Adi aims to demystify technical topics and deliver practical knowledge through his writing, helping tech professionals excel. Who is it for? This book is tailored for data engineers and data analysts who want to leverage GCP for building efficient and scalable data systems. Readers should have a beginner-level understanding of topics like data science, Python, and Linux to fully benefit from the material. It is also suitable for individuals preparing for the Google Professional Data Engineer exam. The book is a practical companion for enhancing cloud and data engineering skills.

PostgreSQL 14 Administration Cookbook

2022-03-31 O'Reilly Amazon

book

Simon Riggs , Gianni Ciolli

data data-engineering relational-databases postgresql Cloud Computing Cyber Security

PostgreSQL 14 Administration Cookbook provides a hands-on guide to mastering the administration of PostgreSQL 14. With over 175 recipes, this book equips you with practical techniques to manage, secure, and optimize your PostgreSQL databases, ensuring they are robust and high-performing. What this Book will help me do Master managing PostgreSQL databases both on-premises and in the cloud efficiently. Implement effective backup and recovery strategies to secure your data. Leverage the latest features of PostgreSQL 14 to enhance your database workflows. Understand and apply best practices for maintaining high availability and performance. Troubleshoot real-world challenges with guided solutions and expert insights. Author(s) Simon Riggs and Gianni Ciolli are seasoned database experts with years of experience working with PostgreSQL. Simon is a PostgreSQL core team member, contributing his technical knowledge towards building robust database solutions, while Gianni brings a wealth of expertise in database administration and support. Together, they share a passion for making complex database concepts accessible and actionable. Who is it for? This book is for database administrators, data architects, and developers who manage PostgreSQL databases and are looking to deepen their knowledge. It is suitable for professionals with some experience in PostgreSQL who aim to maximize their database's performance and security, as well as for those new to the system seeking a comprehensive start. Readers with an interest in practical, problem-solving approaches to database management will greatly benefit from this cookbook.

Simplify Big Data Analytics with Amazon EMR

2022-03-25 O'Reilly Amazon

book

Sakti Mishra

data data-engineering apache-spark Analytics AWS Amazon EMR

Simplify Big Data Analytics with Amazon EMR is a thorough guide to harnessing Amazon's EMR service for big data processing and analytics. From distributed computation pipelines to real-time streaming analytics, this book provides hands-on knowledge and actionable steps for implementing data solutions efficiently. What this Book will help me do Understand the architecture and key components of Amazon EMR and how to deploy it effectively. Learn to configure and manage distributed data processing pipelines using Amazon EMR. Implement security and data governance best practices within the Amazon EMR ecosystem. Master batch ETL and real-time analytics techniques using technologies like Apache Spark. Apply optimization and cost-saving strategies to scalable data solutions. Author(s) Sakti Mishra is a seasoned data professional with extensive expertise in deploying scalable analytics solutions on cloud platforms like AWS. With a background in big data technologies and a passion for teaching, Sakti ensures practical insights accompany every concept. Readers will find his approach thorough, hands-on, and highly informative. Who is it for? This book is perfect for data engineers, data scientists, and other professionals looking to leverage Amazon EMR for scalable analytics. If you are familiar with Python, Scala, or Java and have some exposure to Hadoop or AWS ecosystems, this book will empower you to design and implement robust data pipelines efficiently.

Data Lakehouse in Action

2022-03-17 O'Reilly Amazon

book

Pradeep Menon

data data-engineering storage-repositories data-lake Analytics Azure

"Data Lakehouse in Action" provides a comprehensive exploration of the Data Lakehouse architecture, a modern solution for scalable and effective large-scale analytics. This book guides you through understanding the principles and components of the architecture, and its implementation using cloud platforms like Azure. Learn the practical techniques for designing robust systems tailored to organizational needs and maturity. What this Book will help me do Understand the evolution and need for modern data architecture patterns like Data Lakehouse. Learn how to design systems for data ingestion, storage, processing, and serving in a Data Lakehouse. Develop best practices for data governance and security in the Data Lakehouse architecture. Discover various analytics workflows enabled by the Data Lakehouse, including real-time and batch approaches. Implement practical Data Lakehouse patterns on a cloud platform, and integrate them with macro-patterns such as Data Mesh. Author(s) Pradeep Menon is a seasoned data architect and engineer with extensive experience implementing data analytics solutions for leading companies. With a penchant for simplifying complex architectures, Pradeep has authored several technical publications and frequently shares his expertise at industry conferences. His hands-on approach and passion for teaching shine through in his practical guides. Who is it for? This book is ideal for data professionals including architects, engineers, and data strategists eager to enhance their knowledge in modern analytics platforms. If you have a basic understanding of data architecture and are curious about implementing systems governed by the Data Lakehouse paradigm, this book is for you. It bridges foundational concepts with advanced practices, making it suitable for learners aiming to contribute effectively to their organization's analytics efforts.

IBM Spectrum Virtualize, IBM FlashSystem, and IBM SAN Volume Controller Security Feature Checklist

2022-03-16 O'Reilly Amazon

book

James Whitaker , Bill Scales , Barry Whyte

data data-engineering IBM Cloud Computing Cyber Security

IBM Spectrum® Virtualize based storage systems are secure storage platforms that implement various security-related features, in terms of system-level access controls and data-level security features. This document outlines the available security features and options of IBM Spectrum Virtualize based storage systems. It is not intended as a "how to" or best practice document. Instead, it is a checklist of features that can be reviewed by a user security team to aid in the definition of a policy to be followed when implementing IBM FlashSystem®, IBM SAN Volume Controller, and IBM Spectrum Virtualize for Public Cloud. The topics that are discussed in this paper can be broadly split into two categories: System security This type of security encompasses the first three lines of defense that prevent unauthorized access to the system, protect the logical configuration of the storage system, and restrict what actions users can perform. It also ensures visibility and reporting of system level events that can be used by a Security Information and Event Management (SIEM) solution, such as IBM QRadar®. Data security This type of security encompasses the fourth line of defense. It protects the data that is stored on the system against theft, loss, or attack. These data security features include, but are not limited to, encryption of data at rest (EDAR) or IBM Safeguarded Copy (SGC). This document is correct as of IBM Spectrum Virtualize version 8.5.0.

Data Analysis with Python and PySpark

2022-03-15 O'Reilly Amazon

book

Jonathan Rioux

data data-engineering apache-spark PySpark AI/ML Analytics

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines. In Data Analysis with Python and PySpark you will learn how to: Manage your data as it scales across multiple machines Scale up your data programs with full confidence Read and write data to and from a variety of sources and formats Deal with messy data with PySpark’s data manipulation functionality Discover new data sets and perform exploratory data analysis Build automated data pipelines that transform, summarize, and get insights from data Troubleshoot common PySpark errors Creating reliable long-running jobs Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required. About the Technology The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem. About the Book Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code. What's Inside Organizing your PySpark code Managing your data, no matter the size Scale up your data programs with full confidence Troubleshooting common data pipeline problems Creating reliable long-running jobs About the Reader Written for data scientists and data engineers comfortable with Python. About the Author As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts. Quotes A clear and in-depth introduction for truly tackling big data with Python. - Gustavo Patino, Oakland University William Beaumont School of Medicine The perfect way to learn how to analyze and master huge datasets. - Gary Bake, Brambles Covers both basic and more advanced topics of PySpark, with a good balance between theory and hands-on. - Philippe Van Bergenl, P² Consulting For beginner to pro, a well-written book to help understand PySpark. - Raushan Kumar Jha, Microsoft

Getting Started with CockroachDB

2022-03-11 O'Reilly Amazon

book

Kishen Das Kondabagilu Rajanna

data data-engineering relational-databases cockroachdb Cloud Computing Cyber Security

"Getting Started with CockroachDB" provides an in-depth introduction to CockroachDB, a modern, distributed SQL database designed for cloud-native applications. Through this guide, you'll learn how to deploy, manage, and optimize CockroachDB to build highly reliable, scalable database solutions tailored for demanding and distributed workloads. What this Book will help me do Understand the architecture and design principles of CockroachDB and its fault-tolerant model. Learn how to set up and manage CockroachDB clusters for high availability and automatic scaling. Discover the concepts of data distribution and geo-partitioning to achieve low-latency global interactions. Explore indexing mechanisms in CockroachDB to optimize query performance for fast data retrieval. Master operational strategies, security configuration, and troubleshooting techniques for database management. Author(s) Kishen Das Kondabagilu Rajanna is an experienced software developer and database expert with a deep interest in distributed architectures. With hands-on experience working with CockroachDB and other database technologies, Kishen is passionate about sharing actionable insights with readers. His approach focuses on equipping developers with practical skills to excel in building and managing scalable, efficient database services. Who is it for? This book is ideal for software developers, database administrators, and database engineers seeking to learn CockroachDB for building robust, scalable database systems. If you're new to CockroachDB but possess basic database knowledge, this guide will equip you with the practical skills to leverage CockroachDB's capabilities effectively.

Snowflake Access Control: Mastering the Features for Data Privacy and Regulatory Compliance

2022-03-02 O'Reilly Amazon

book

Jessica Megan Larson

data data-engineering Snowflake Cloud Computing DWH GDPR/CCPA

Understand the different access control paradigms available in the Snowflake Data Cloud and learn how to implement access control in support of data privacy and compliance with regulations such as GDPR, APPI, CCPA, and SOX. The information in this book will help you and your organization adhere to privacy requirements that are important to consumers and becoming codified in the law. You will learn to protect your valuable data from those who should not see it while making it accessible to the analysts whom you trust to mine the data and create business value for your organization. Snowflake is increasingly the choice for companies looking to move to a data warehousing solution, and security is an increasing concern due to recent high-profile attacks. This book shows how to use Snowflake's wide range of features that support access control, making it easier to protect data access from the data origination point all the way to the presentation and visualization layer.Reading this book helps you embrace the benefits of securing data and provide valuable support for data analysis while also protecting the rights and privacy of the consumers and customers with whom you do business. What You Will Learn Identify data that is sensitive and should be restricted Implement access control in the Snowflake Data Cloud Choose the right access control paradigm for your organization Comply with CCPA, GDPR, SOX, APPI, and similar privacy regulations Take advantage of recognized best practices for role-based access control Prevent upstream and downstream services from subverting your access control Benefit from access control features unique to the Snowflake Data Cloud Who This Book Is For Data engineers, database administrators, and engineering managers who wantto improve their access control model; those whose access control model is not meeting privacy and regulatory requirements; those new to Snowflake who want to benefit from access control features that are unique to the platform; technology leaders in organizations that have just gone public and are now required to conform to SOX reporting requirements

Azure Data Engineer Associate Certification Guide

2022-02-28 O'Reilly Amazon

book

Newton Alex

it-operations cloud-computing cloud-platforms microsoft-azure Azure Cloud Computing

The "Azure Data Engineer Associate Certification Guide" is a comprehensive resource tailored for professionals preparing for the DP-203 exam. This book not only equips you with the theoretical knowledge needed to pass the certification but also provides hands-on experience with Azure's data engineering services. By the end of the book, you'll feel confident in tackling the certification exam and applying these skills on the job. What this Book will help me do Understand the core concepts of Azure data engineering and their practical applications. Gain proficiency in designing and deploying data storage and processing solutions using Azure services. Develop expertise in securing, monitoring, and optimizing Azure data solutions. Prepare effectively for the DP-203 certification exam with sample questions and practical exercises. Acquire skills to contribute to and excel in real-world Azure Data Engineering projects. Author(s) None Alex is a seasoned data engineer and cloud computing expert with years of experience designing, implementing, and optimizing data solutions. They have spent significant time working with Azure's ecosystem and have crafted this guide to share their insights and best practices. With a passion for teaching and mentoring, they aim to make complex technical concepts accessible to learners. Who is it for? This book caters to data engineering professionals aiming to achieve the DP-203 Azure Data Engineer Associate certification and advance their careers. It's ideal for individuals with fundamental knowledge of cloud-based data solutions and databases, seeking specialized expertise in Azure's data engineering tools. Whether you're upskilling or transitioning to a cloud-native environment, this guide serves as the roadmap to success.

What Is Distributed SQL?

2022-02-25 O'Reilly Amazon

book

Charles Custer , Jim Walker , Paul Modderman

data data-engineering nosql-databases Cloud Computing ELK NoSQL

Globally available resources have become the status quo. They're accessible, distributed, and resilient. Our traditional SQL database options haven't kept up. Centralized SQL databases, even those with read replicas in the cloud, put all the transactional load on a central system. The further away that a transaction happens from the user, the more the user experience suffers. If the transactional data powering the application is greatly slowed down, fast-loading web pages mean nothing. In this report, Paul Modderman, Jim Walker, and Charles Custer explain how distributed SQL fits all applications and eliminates complex challenges like sharding from traditional RDBMS systems. You'll learn how distributed SQL databases can reach global scale without introducing the consistency trade-offs found in NoSQL solutions. These databases come to life through cloud computing, while legacy databases simply can't rise to meet the elastic and ubiquitous new paradigm. You'll learn: Key concepts driving this new technology, including the CAP theorem, the Raft consensus algorithm, multiversion concurrency control, and Google Spanner How distributed SQL databases meet enterprise requirements, including management, security, integration, and Everything as a Service (XaaS) The impact that distributed SQL has already made in the telecom, retail, and gaming industries Why serverless computing is an ideal fit for distributed SQL How distributed SQL can help you expand your company's strategic plan

Cassandra: The Definitive Guide, (Revised) Third Edition, 3rd Edition

2022-01-23 O'Reilly Amazon

book

Eben Hewitt , Jeff Carpenter

data data-engineering nosql-databases Cassandra Cloud Computing Data Modelling

Imagine what you could do if scalability wasn't a problem. With this hands-on guide, you'll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This revised third edition--updated for Cassandra 4.0 and new developments in the Cassandra ecosystem, including deployments in Kubernetes with K8ssandra--provides technical details and practical examples to help you put this database to work in a production environment. Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra's nonrelational design, with special attention to data modeling. Developers, DBAs, and application architects looking to solve a database scaling issue or future-proof an application will learn how to harness Cassandra's speed and flexibility. Understand Cassandra's distributed and decentralized structure Use the Cassandra Query Language (CQL) and cqlsh (the CQL shell) Create a working data model and compare it with an equivalent relational model Design and develop applications using client drivers Explore cluster topology and learn how nodes exchange data Maintain a high level of performance in your cluster Deploy Cassandra onsite, in the cloud, or with Docker and Kubernetes Integrate Cassandra with Spark, Kafka, Elasticsearch, Solr, and Lucene

Installing and Configuring IBM Db2 AI for IBM z/OS v1.4.0

2022-01-04 O'Reilly Amazon

book

Tim Hogan , Janet Figone , Lydia Parziale , Guanjun Cai

data data-engineering relational-databases ibm-db2 AI/ML Cloud Computing

Artificial intelligence (AI) enables computers and machines to mimic the perception, learning, problem-solving, and decision-making capabilities of the human mind. AI development is made possible by the availability of large amounts of data and the corresponding development and wide availability of computer systems that can process all that data faster and more accurately than humans can. What happens if you infuse AI with a world-class database management system, such as IBM Db2®? IBM® has done just that with Db2 AI for z/OS (Db2ZAI). Db2ZAI is built to infuse AI and data science to assist businesses in the use of AI to develop applications more easily. With Db2ZAI, the following benefits are realized: Data science functionality Better built applications Improved database performance (and DBA's time and efforts are saved) through simplification and automation of error reporting and routine tasks Machine learning (ML) optimizer to improve query access paths and reduce the need for manual tuning and query optimization Integrated data access that makes data available from various vendors including private cloud providers. This IBM Redpaper® publication helps to simplify your installation by tailoring and configuration of Db2 AI for z/OS®. It was written for system programmers, system administrators, and database administrators.

Data Engineering with AWS

2021-12-29 O'Reilly Amazon

book

Gareth Eagar

data data-engineering AI/ML Athena AWS Big Data

Discover how to effectively build and manage data engineering pipelines using AWS with "Data Engineering with AWS". In this hands-on book, you'll explore the foundational principles of data engineering, learn to architect data pipelines, and work with essential AWS services to process, transform, and analyze data. What this Book will help me do Understand and implement modern data engineering pipelines with AWS services. Gain proficiency in automating data ingestion and transformation using Amazon tools. Perform efficient data queries and analysis leveraging Amazon Athena and Redshift. Create insightful data visualizations using Amazon QuickSight. Apply machine learning techniques to enhance data engineering processes. Author(s) None Eagar, a Senior Data Architect with over twenty-five years of experience, specializes in modern data architectures and cloud solutions. With a rich background in applying data engineering to real-world problems, None Eagar shares expertise in a clear and approachable way for readers. Who is it for? This book is perfect for data engineers and data architects aiming to grow their expertise in AWS-based solutions. It's also geared towards beginners in data engineering wanting to adopt the best practices. Those with a basic understanding of big data and cloud platforms will find it particularly valuable, but prior AWS experience is not required.

Optimizing Databricks Workloads

2021-12-24 O'Reilly Amazon

book

Sarthak Sarbahi , Anirudh Kala , Anshul Bhatnagar

data data-engineering apache-spark Analytics Big Data Cloud Computing

Unlock the full potential of Apache Spark on the Databricks platform with "Optimizing Databricks Workloads". This book equips you with must-know techniques to effectively configure, manage, and optimize big data processing pipelines. Dive into real-world scenarios and learn practical approaches to reduce costs and improve performance in your data engineering processes. What this Book will help me do Understand and apply optimization techniques for Databricks workloads. Choose the right cluster configurations to maximize efficiency and minimize costs. Leverage Delta Lake for performance-boosted data processing and optimization. Develop skills for managing Spark DataFrames and core functionalities in Databricks. Gain insights into real-world scenarios to effectively improve workload performance. Author(s) Anirudh Kala and the co-authors are experienced practitioners in the fields of data engineering and analytics. With years of professional expertise in leveraging Apache Spark and Databricks, they bring real-world insight into performance optimization. Their approach blends practical instruction with actionable strategies, making this book an essential guide for data engineers aiming to excel in this domain. Who is it for? This book is tailored for data engineers, data scientists, and cloud architects looking to elevate their skills in managing Databricks workloads. Ideal for readers with basic knowledge of Spark and Databricks, it helps them get hands-on with optimization techniques. If you are aiming to enhance your Spark-based data processing systems, this book offers the guidance you need.

Securing IBM Spectrum Scale with QRadar and IBM Cloud Pak for Security

2021-12-20 O'Reilly Amazon

book

IBM

data data-engineering IBM Cloud Computing Cyber Security

Cyberattacks are likely to remain a significant risk for the foreseeable future. Attacks on organizations can be external and internal. Investing in technology and processes to prevent these cyberattacks is the highest priority for these organizations. Organizations need well-designed procedures and processes to recover from attacks. The focus of this document is to demonstrate how the IBM® Unified Data Foundation (UDF) infrastructure plays an important role in delivering the persistence storage (PV) to containerized applications, such as IBM Cloud® Pak for Security (CP4S), with IBM Spectrum® Scale Container Native Storage Access (CNSA) that is deployed with IBM Spectrum scale CSI driver and IBM FlashSystem® storage with IBM Block storage driver with CSI driver. Also demonstrated is how this UDF infrastructure can be used as a preferred storage class to create back-end persistent storage for CP4S deployments. We also highlight how the file I/O events are captured in IBM QRadar® and offenses are generated based on predefined rules. After the offenses are generated, we show how the cases are automatically generated in IBM Cloud Pak® for Security by using the IBM QRadar SOAR Plugin, with a manually automated method to log a case in IBM Cloud Pak for Security. This document also describes the processes that are required for the configuration and integration of the components in this solution, such as: Integration of IBM Spectrum Scale with QRadar QRadar integration with IBM Cloud Pak for Security Integration of the IBM QRadar SOAR Plugin to generate automated cases in CP4S. Finally, this document shows the use of IBM Spectrum Scale CNSA and IBM FlashSystem storage that uses IBM block CSI driver to provision persistent volumes for CP4S deployment. All models of IBM FlashSystem family are supported by this document, including: FlashSystem 9100 and 9200 FlashSystem 7200 and FlashSystem 5000 models FlashSystem 5200 IBM SAN Volume Controller All storage that is running IBM Spectrum Virtualize software

Snowflake Essentials: Getting Started with Big Data in the Cloud

2021-12-14 O'Reilly Amazon

book

Bjorn Lindstrom , Frank Bell , Ruchi Soni , Sameer Videkar , Bhaskar B. Joshi , Raj Chirumamilla

data data-engineering Snowflake Analytics Big Data Cloud Computing

Understand the essentials of the Snowflake Database and the overall Snowflake Data Cloud. This book covers how Snowflake’s architecture is different from prior on-premises and cloud databases. The authors also discuss, from an insider perspective, how Snowflake grew so fast to become the largest software IPO of all time. Snowflake was the first database made specifically to be optimized with a cloud architecture. This book helps you get started using Snowflake by first understanding its architecture and what separates it from other database platforms you may have used. You will learn about setting up users and accounts, and then creating database objects. You will know how to load data into Snowflake and query and analyze that data, including unstructured data such as data in XML and JSON formats. You will also learn about Snowflake’s compute platform and the different data sharing options that are available. What YouWill Learn Run analytics in the Snowflake Data Cloud Create users and roles in Snowflake Set up security in Snowflake Set up resource monitors in Snowflake Set up and optimize Snowflake Compute Load, unload, and query structured and unstructured data (JSON, XML) within Snowflake Use Snowflake Data Sharing to share data Set up a Snowflake Data Exchange Use the Snowflake Data Marketplace Who This Book Is For Database professionals or information technology professionals who want to move beyond traditional database technologies by learning Snowflake, a new and massively scalable cloud-based database solution

Apache Pulsar in Action

2021-12-13 O'Reilly Amazon

book

David Kjerrumgaard

data data-engineering apache-pulsar Analytics Cloud Computing IoT

Deliver lightning fast and reliable messaging for your distributed applications with the flexible and resilient Apache Pulsar platform. In Apache Pulsar in Action you will learn how to: Publish from Apache Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Perform interactive SQL queries against data stored in Apache Pulsar Apache Pulsar in Action is a comprehensive and practical guide to building high-traffic applications with Pulsar. You’ll learn to use this mature and battle-tested platform to deliver extreme levels of speed and durability to your messaging. Apache Pulsar committer David Kjerrumgaard teaches you to apply Pulsar’s seamless scalability through hands-on case studies, including IOT analytics applications and a microservices app based on Pulsar functions. About the Technology Reliable server-to-server messaging is the heart of a distributed application. Apache Pulsar is a flexible real-time messaging platform built to run on Kubernetes and deliver the scalability and resilience required for cloud-based systems. Pulsar supports both streaming and message queuing, and unlike other solutions, it can communicate over multiple protocols including MQTT, AMQP, and Kafka’s binary protocol. About the Book Apache Pulsar in Action teaches you to build scalable streaming messaging systems using Pulsar. You’ll start with a rapid introduction to enterprise messaging and discover the unique benefits of Pulsar. Following crystal-clear explanations and engaging examples, you’ll use the Pulsar Functions framework to develop a microservices-based application. Real-world case studies illustrate how to implement the most important messaging design patterns. What's Inside Publish from Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Create an event-driven food delivery application About the Reader Written for experienced Java developers. No prior knowledge of Pulsar required. About the Author David Kjerrumgaard is a committer on the Apache Pulsar project. He currently serves as a Developer Advocate for StreamNative, where he develops Pulsar best practices and solutions. Quotes Apache Pulsar in Action is able to seamlessly mix the theory and abstract concepts with the clarity of practical step-by-step examples. I’d recommend to anyone! - Matteo Merli, co-creator of Apache Pulsar Gives readers insights into how the ‘magic’ works… Definitely recommended. - Henry Saputra, Splunk A complete, practical, fun-filled book. - Satej Kumar Sahu, Honeywell A definitive guide that will help you scale your applications. - Alessandro Campeis, Vimar The best book to start working with Pulsar. - Emanuele Piccinelli, Empirix

Cloud-Native Microservices with Apache Pulsar: Build Distributed Messaging Microservices

2021-12-04 O'Reilly Amazon

book

Rahul Sharma , Mohammad Atyab

data data-engineering apache-pulsar AWS AWS Lambda Cloud Computing

Apply different enterprise integration and processing strategies available with Pulsar, Apache's multi-tenant, high-performance, cloud-native messaging and streaming platform. This book is a comprehensive guide that examines using Pulsar Java libraries to build distributed applications with message-driven architecture. You'll begin with an introduction to Apache Pulsar architecture. The first few chapters build a foundation of message-driven architecture. Next, you'll perform a setup of all the required Pulsar components. The book also covers work with Apache Pulsar client library to build producers and consumers for the discussed patterns. You'll then explore the transformation, filter, resiliency, and tracing capabilities available with Pulsar. Moving forward, the book will discuss best practices when building message schemas and demonstrate integration patterns using microservices. Security is an important aspect of any application;the book will cover authentication and authorization in Apache Pulsar such as Transport Layer Security (TLS), OAuth 2.0, and JSON Web Token (JWT). The final chapters will cover Apache Pulsar deployment in Kubernetes. You'll build microservices and serverless components such as AWS Lambda integrated with Apache Pulsar on Kubernetes. After completing the book, you'll be able to comfortably work with the large set of out-of-the-box integration options offered by Apache Pulsar. What You'll Learn Examine the important Apache Pulsar components Build applications using Apache Pulsar client libraries Use Apache Pulsar effectively with microservices Deploy Apache Pulsar to the cloud Who This Book Is For Cloud architects and software developers who build systems in the cloud-native technologies.

Efficient MySQL Performance

2021-11-30 O'Reilly Amazon

book

Daniel Nichter

data data-engineering relational-databases MySQL Cloud Computing SQL

You'll find several books on basic or advanced MySQL performance, but nothing in between. That's because explaining MySQL performance without addressing its complexity is difficult. This practical book bridges the gap by teaching software engineers mid-level MySQL knowledge beyond the fundamentals, but well shy of deep-level internals required by database administrators (DBAs). Daniel Nichter shows you how to apply the best practices and techniques that directly affect MySQL performance. You'll learn how to improve performance by analyzing query execution, indexing for common SQL clauses and table joins, optimizing data access, and understanding the most important MySQL metrics. You'll also discover how replication, transactions, row locking, and the cloud influenceMySQL performance. Understand why query response time is the North Star of MySQL performance Learn query metrics in detail, including aggregation, reporting, and analysis See how to index effectively for common SQL clauses and table joins Explore the most important server metrics and what they reveal about performance Dive into transactions and row locking to gain deep, actionable insight Achieve remarkable MySQL performance at any scale

High Performance MySQL, 4th Edition

2021-11-18 O'Reilly Amazon

book

Jeremy Tinley , Silvia Botros

data data-engineering relational-databases MySQL Cloud Computing Cyber Security

How can you realize MySQL's full power? With High Performance MySQL, you'll learn advanced techniques for everything from setting service-level objectives to designing schemas, indexes, and queries to tuning your server, operating system, and hardware to achieve your platform's full potential. This guide also teaches database administrators safe and practical ways to scale applications through replication, load balancing, high availability, and failover. Updated to reflect recent advances in cloud- and self-hosted MySQL, InnoDB performance, and new features and tools, this revised edition helps you design a relational data platform that will scale with your business. You'll learn best practices for database security along with hard-earned lessons in both performance and database stability. Dive into MySQL's architecture, including key facts about its storage engines Learn how server configuration works with your hardware and deployment choices Make query performance part of your software delivery process Examine enhancements to MySQL's replication and high availability Compare different MySQL offerings in managed cloud environments Explore MySQL's full stack optimization from application-side configuration to server tuning Turn traditional database management tasks into automated processes

Storage as a Service Offering Guide

2021-11-16 O'Reilly Amazon

book

Vasfi Gucer , Hartmut Lonzer , Markus Standau

data data-engineering storage-repositories cloud-storage Cloud Computing Cloud Storage

IBM® Storage as a Service (STaaS) extends your hybrid cloud experience with a new flexible consumption model enabled for both your on-premises and hybrid cloud infrastructure needs, giving you the agility, cash flow efficiency, and services of cloud storage with the flexibility to dynamically scale up or down and only pay for what you use beyond the minimal capacity. This IBM Redpaper provides a detailed introduction to the IBM STaaS service. The paper is targeted for data center managers and storage administrators.

Storage Systems

2021-10-13 O'Reilly Amazon

book

Alexander Thomasian

data data-engineering storage-repositories networked-storage-file-systems networked storage & file systems Analytics

Storage Systems: Organization, Performance, Coding, Reliability and Their Data Processing was motivated by the 1988 Redundant Array of Inexpensive/Independent Disks proposal to replace large form factor mainframe disks with an array of commodity disks. Disk loads are balanced by striping data into strips—with one strip per disk— and storage reliability is enhanced via replication or erasure coding, which at best dedicates k strips per stripe to tolerate k disk failures. Flash memories have resulted in a paradigm shift with Solid State Drives (SSDs) replacing Hard Disk Drives (HDDs) for high performance applications. RAID and Flash have resulted in the emergence of new storage companies, namely EMC, NetApp, SanDisk, and Purestorage, and a multibillion-dollar storage market. Key new conferences and publications are reviewed in this book.The goal of the book is to expose students, researchers, and IT professionals to the more important developments in storage systems, while covering the evolution of storage technologies, traditional and novel databases, and novel sources of data. We describe several prototypes: FAWN at CMU, RAMCloud at Stanford, and Lightstore at MIT; Oracle's Exadata, AWS' Aurora, Alibaba's PolarDB, Fungible Data Center; and author's paper designs for cloud storage, namely heterogeneous disk arrays and hierarchical RAID. Surveys storage technologies and lists sources of data: measurements, text, audio, images, and video Familiarizes with paradigms to improve performance: caching, prefetching, log-structured file systems, and merge-trees (LSMs) Describes RAID organizations and analyzes their performance and reliability Conserves storage via data compression, deduplication, compaction, and secures data via encryption Specifies implications of storage technologies on performance and power consumption Exemplifies database parallelism for big data, analytics, deep learning via multicore CPUs, GPUs, FPGAs, and ASICs, e.g., Google's Tensor Processing Units

Snowflake Security: Securing Your Snowflake Data Cloud

2021-10-05 O'Reilly Amazon

book

Ben Herzberg , Yoav Cohen

data data-engineering Snowflake Cloud Computing Data Engineering DWH

This book is your complete guide to Snowflake security, covering account security, authentication, data access control, logging and monitoring, and more. It will help you make sure that you are using the security controls in a right way, are on top of access control, and making the most of the security features in Snowflake. Snowflake is the fastest growing cloud data warehouse in the world, and having the right methodology to protect the data is important both to data engineers and security teams. It allows for faster data enablement for organizations, as well as reducing security risks, meeting compliance requirements, and solving data privacy challenges. There are currently tens of thousands of people who are either data engineers/data ops in Snowflake-using organizations, or security people in such organizations. This book provides guidance when you want to apply certain capabilities, such as data masking, row-level security, column-level security, tackling rolehierarchy, building monitoring dashboards, etc., to your organizations. What You Will Learn Implement security best practices for Snowflake Set up user provisioning, MFA, OAuth, and SSO Set up a Snowflake security model Design roles architecture Use advanced access control such as row-based security and dynamic masking Audit and monitor your Snowflake Data Cloud Who This Book Is For Data engineers, data privacy professionals, and security teams either with security knowledge (preferably some data security knowledge) or with data engineering knowledge; in other words, either “Snowflake people” or “data people” who want to get security right, or “security people” who want to make sure that Snowflake gets handled right in terms of security

Learning MySQL, 2nd Edition

2021-09-09 O'Reilly Amazon

book

Vinicius M. Grippa , Sergey Kuzmichev

data data-engineering relational-databases MySQL Cloud Computing RDBMS

Get a comprehensive overview on how to set up and design an effective database with MySQL. This thoroughly updated edition covers MySQL's latest version, including its most important aspects. Whether you're deploying an environment, troubleshooting an issue, or engaging in disaster recovery, this practical guide provides the insights and tools necessary to take full advantage of this powerful RDBMS. Authors Vinicius Grippa and Sergey Kuzmichev from Percona show developers and DBAs methods for minimizing costs and maximizing availability and performance. You'll learn how to perform basic and advanced querying, monitoring and troubleshooting, database management and security, backup and recovery, and tuning for improved efficiency. This edition includes new chapters on high availability, load balancing, and using MySQL in the cloud. Get started with MySQL and learn how to use it in production Deploy MySQL databases on bare metal, on virtual machines, and in the cloud Design database infrastructures Code highly efficient queries Monitor and troubleshoot MySQL databases Execute efficient backup and restore operations Optimize database costs in the cloud Understand database concepts, especially those pertaining to MySQL

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Logging in Action

Data Engineering with Google Cloud Platform

PostgreSQL 14 Administration Cookbook

Simplify Big Data Analytics with Amazon EMR

Data Lakehouse in Action

IBM Spectrum Virtualize, IBM FlashSystem, and IBM SAN Volume Controller Security Feature Checklist

Data Analysis with Python and PySpark

Getting Started with CockroachDB

Snowflake Access Control: Mastering the Features for Data Privacy and Regulatory Compliance

Azure Data Engineer Associate Certification Guide

What Is Distributed SQL?

Cassandra: The Definitive Guide, (Revised) Third Edition, 3rd Edition

Installing and Configuring IBM Db2 AI for IBM z/OS v1.4.0

Data Engineering with AWS

Optimizing Databricks Workloads

Securing IBM Spectrum Scale with QRadar and IBM Cloud Pak for Security

Snowflake Essentials: Getting Started with Big Data in the Cloud

Apache Pulsar in Action

Cloud-Native Microservices with Apache Pulsar: Build Distributed Messaging Microservices

Efficient MySQL Performance

High Performance MySQL, 4th Edition

Storage as a Service Offering Guide

Storage Systems

Snowflake Security: Securing Your Snowflake Data Cloud

Learning MySQL, 2nd Edition