talk-data.com talk-data.com

Topic

S3

Amazon S3

object_storage cloud_storage aws

22

tagged

Activity Trend

11 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: O'Reilly Data Engineering Books ×
PostgreSQL Skills Development on Cloud: A Practical Guide to Database Management with AWS and Azure

This book provides a comprehensive approach to manage PostgreSQL cluster databases on Amazon Web Services and Azure Web Services on the cloud, as well as in Docker and container environments on a Red Hat operating system. Furthermore, detailed references for managing PostgreSQL on both Windows and Mac are provided. This book condenses all the fundamental and essential concepts you need to manage a PostgreSQL cluster into a one-stop guide that is perfect for newcomers to Postgres database administration. Each chapter of the book provides historical context and documents version changes of the PostgreSQL cluster, elucidates practical "how-to" methods, and includes illustrations and key word definitions, practices for application, a summary of key learnings, and questions to reinforce understanding. The book also outlines a clear study objective with a weekly learning schedule and hundreds of practice exercises, along with questions and answers. With its comprehensive and practical approach, this book will help you gain the confidence to manage all aspects of a PostgreSQL cluster in critical production environments so you can better support your organization's database infrastructure on the cloud and in containers. What You Will Learn Install and configure Postgres clusters on the cloud and in containers, monitor database logs, start and stop databases, troubleshoot, tune performance, backup and recover, and integrate with Amazon S3 and Azure Data Blob Manage Postgres databases on Amazon Web Services and Azure Web Services on the cloud, as well as in Docker and container environments on a Red Hat operating system Access sample references to scripting solutions and database management tools for working with Postgres, Redshift (based on Postgres 8.2), and Docker Create Amazon Machine Images (AMI) and Azure Images for managing a fleet of Postgres clusters on the cloud Reinforce knowledge with a weekly learning schedule and hundreds of practice exercises, along with questions and answers Progress from simple concepts, such as how to choose the correct instance type, to creating complex machine images Gain access to an Amazon AMI with a DBA admin tool, allowing you to learn Postgres, Redshift, and Docker in a cloud environment Refer to a comprehensive summary of documentations of Postgres, Amazon Web services, Azure Web services, and Red Hat Linux for managing all aspects of Postgres cluster management on the cloud Who This Book Is For Newcomers to PostgreSQL database administration and cross-platform support DBAs looking to master PostgreSQL on the cloud.

IBM TS7700 Release 5.3 Guide

This IBM Redbooks® publication covers IBM TS7700 R5.3. The IBM TS7700 is part of a family of IBM Enterprise tape products. This book is intended for system architects and storage administrators who want to integrate their storage systems for optimal operation. Building on over 25 years of experience, the R5.3 release includes many features that enable improved performance, usability, and security. Highlights include the IBM TS7700 Advanced Object Store, an all flash TS7770, grid resiliency enhancements, and Logical WORM retention. By using the same hierarchical storage techniques, the TS7700 (TS7770 and TS7760) can also off load to object storage. Because object storage is cloud-based and accessible from different regions, the TS7700 Cloud Storage Tier support essentially allows the cloud to be an extension of the grid. As of this writing, the TS7700C supports the ability to off load to IBM Cloud Object Storage, Amazon S3, and RSTOR. This publication explains features and concepts that are specific to the IBM TS7700 as of release R5.3. The R5.3 microcode level provides IBM TS7700 Cloud Storage Tier enhancements, IBM DS8000 Object Storage enhancements, Management Interface dual control security, and other smaller enhancements. The R5.3 microcode level can be installed on the IBM TS7770 and IBM TS7760 models only. TS7700 provides tape virtualization for the IBM Z® environment. Off loading to physical tape behind a TS7700 is used by hundreds of organizations around the world. New and existing capabilities of the TS7700 5.3 release includes the following highlights: Support for IBM TS1160 Tape Drives and JE/JM media Eight-way Grid Cloud, which consists of up to three generations of TS7700 Synchronous and asynchronous replication of virtual tape and TCT objects Grid access to all logical volume and object data independent of where it resides An all flash TS7770 option for improved performance Full Advanced Object Store Grid Cloud support of DS8000 Transparent Cloud Tier Full AES256 encryption for data that is in-flight and at-rest Tight integration with IBM Z and DFSMS policy management DS8000 Object Store with AES256 in-flight encryption and compression Regulatory compliance through Logical WORM and LWORM Retention support Cloud Storage Tier support for archive, logical volume versions, and disaster recovery Optional integration with physical tape 16 Gb IBM FICON® throughput that exceeds 4 GBps per TS7700 cluster Grid Resiliency Support with Control Unit Initiated Reconfiguration (CUIR) support IBM Z hosts view up to 3,968 3490 devices per TS7700 grid TS7770 Cache On Demand feature that uses capacity-based licensing TS7770 support of SSD within the VED server The TS7700T writes data by policy to physical tape through attachment to high-capacity, high-performance IBM TS1160, IBM TS1150, and IBM TS1140 tape drives that are installed in an IBM TS4500 or TS3500 tape library. The TS7770 models are based on high-performance and redundant IBM Power9® technology. They provide improved performance for most IBM Z tape workloads when compared to the previous generations of IBM TS7700.

Data Engineering with AWS - Second Edition

Learn data engineering and modern data pipeline design with AWS in this comprehensive guide! You will explore key AWS services like S3, Glue, Redshift, and QuickSight to ingest, transform, and analyze data, and you'll gain hands-on experience creating robust, scalable solutions. What this Book will help me do Understand and implement data ingestion and transformation processes using AWS tools. Optimize data for analytics with advanced AWS-powered workflows. Build end-to-end modern data pipelines leveraging cutting-edge AWS technologies. Design data governance strategies using AWS services for security and compliance. Visualize data and extract insights using Amazon QuickSight and other tools. Author(s) Gareth Eagar is a Senior Data Architect with over 25 years of experience in designing and implementing data solutions across various industries. He combines his deep technical expertise with a passion for teaching, aiming to make complex concepts approachable for learners at all levels. Who is it for? This book is intended for current or aspiring data engineers, data architects, and analysts seeking to leverage AWS for data engineering. It suits beginners with a basic understanding of data concepts who want to gain practical experience as well as intermediate professionals aiming to expand into AWS-based systems.

Delta Lake: Up and Running

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS. This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights. You'll learn how to: Use modern data management and data engineering techniques Understand how ACID transactions bring reliability to data lakes at scale Run streaming and batch jobs against your data lake concurrently Execute update, delete, and merge commands against your data lake Use time travel to roll back and examine previous data versions Build a streaming data quality pipeline following the medallion architecture

Geospatial Data Analytics on AWS

In "Geospatial Data Analytics on AWS," you will learn how to store, manage, and analyze geospatial data effectively using various AWS services. This book provides insight into building geospatial data lakes, leveraging AWS databases, and applying best practices to derive insights from spatial data in the cloud. What this Book will help me do Design and manage geospatial data lakes on AWS leveraging S3 and other storage solutions. Analyze geospatial data using AWS services such as Athena and Redshift. Utilize machine learning models for geospatial data processing and analytics using SageMaker. Visualize geospatial data through services like Amazon QuickSight and OpenStreetMap integration. Avoid common pitfalls when managing geospatial data in the cloud. Author(s) Scott Bateman, Janahan Gnanachandran, and Jeff DeMuth bring their extensive experience in cloud computing and geospatial analytics to this book. With backgrounds in cloud architecture, data science, and geospatial applications, they aim to make complex topics accessible. Their collaborative approach ensures readers can practically apply concepts to real-world challenges. Who is it for? This book is ideal for GIS and data professionals, including developers, analysts, and scientists. It suits readers with a basic understanding of geographical concepts but no prior AWS experience. If you're aiming to enhance your cloud-based geospatial data management and analytics skills, this is the guide for you.

Offloading storage volumes from Safeguarded Copy to AWS S3 Object Storage with IBM FlashSystem Transparent Cloud Tiering

The focus of this IBM® Blueprint is to showcase a method to store volumes that are created by using Safeguarded Copy off-premise to Amazon S3 object storage that uses the IBM FlashSystem Transparent cloud tiering (TCT) feature. TCT enables volume data to be copied and transferred to object storage. The TCT feature supports creating connections to cloud service providers to store copies of volume data in private or public clouds. This feature is useful for organizations of all sizes when planning for disaster recovery operations or storing a copy of data as extra backup. TCT provides seamless integration between the storage system and public or private clouds for Safeguarded Copy volumes and non-Safeguarded Copy volumes.

SQL Server 2022 Revealed: A Hybrid Data Platform Powered by Security, Performance, and Availability

Know how to use the new capabilities and cloud integrations in SQL Server 2022. This book covers the many innovative integrations with the Azure Cloud that make SQL Server 2022 the most cloud-connected edition ever. The book covers cutting-edge features such as the blockchain-based Ledger for creating a tamper-evident record of changes to data over time that you can rely on to be correct and reliable. You'll learn about built-in Query Intelligence capabilities to help you to upgrade with confidence that your applications will perform at least as fast after the upgrade than before. In fact, you'll probably see an increase in performance from the upgrade, with no code changes needed. Also covered are innovations such as contained availability groups and data virtualization with S3 object storage. New cloud integrations covered in this book include Microsoft Azure Purview and the use of Azure SQL for high availability and disaster recovery. The bookcovers Azure Synapse Link with its built-in capabilities to take changes and put them into Synapse automatically. Anyone building their career around SQL Server will want this book for the valuable information it provides on building SQL skills from edge to the cloud. ​ What You Will Learn Know how to use all of the new capabilities and cloud integrations in SQL Server 2022 Connect to Azure for disaster recovery, near real-time analytics, and security Leverage the Ledger to create a tamper-evident record of data changes over time Upgrade from prior releases and achieve faster and more consistent performance with no code changes Access data and storage in different and new formats, such as Parquet and S3, without moving the data and using your existing T-SQL skills Explore new application scenarios using innovations with T-SQL in areassuch as JSON and time series Who This Book Is For SQL Server professionals who want to upgrade their skills to the latest edition of SQL Server; those wishing to take advantage of new integrations with Microsoft Azure Purview (governance), Azure Synapse (analytics), and Azure SQL (HA and DR); and those in need of the increased performance and security offered by Query Intelligence and the new Ledger

IBM TS7700 Release 5.2.2 Guide

This IBM® Redbooks® publication covers IBM TS7700 R5.2. The IBM TS7700 is part of a family of IBM Enterprise tape products. This book is intended for system architects and storage administrators who want to integrate their storage systems for optimal operation. Building on 25 years of experience, the R5.2 release includes many features that enable improved performance, usability, and security. Highlights include IBM TS7700 Advanced Object Store, an all flash TS7770, grid resiliency enhancements, and Logical WORM retention. By using the same hierarchical storage techniques, the TS7700 (TS7770 and TS7760) can also off load to object storage. Because object storage is cloud-based and accessible from different regions, the TS7700 Cloud Storage Tier support essentially allows the cloud to be an extension of the grid. As of this writing, the TS7700C supports the ability to off load to IBM Cloud® Object Storage, Amazon S3, and RSTOR. This publication explains features and concepts that are specific to the IBM TS7700 as of release R5.2. The R5.2 microcode level provides IBM TS7700 Cloud Storage Tier enhancements, IBM DS8000® Object Storage enhancements, Management Interface dual control security, and other smaller enhancements. The R5.2 microcode level can be installed on the IBM TS7770 and IBM TS7760 models only. Note: The latest Release 5.2 was split into two phases: R5.2 Phase 1 (also referred to as and ) R5.2 Phase 2 ( and R) TS7700 provides tape virtualization for the IBM z environment. Off loading to physical tape behind a TS7700 is used by hundreds of organizations around the world. Tape virtualization can help satisfy the following requirements in a data processing environment. New and existing capabilities of the TS7700 5.2.2 release includes the following highlights: Eight-way Grid Cloud, which consists of up to three generations of TS7700 Synchronous and asynchronous replication of virtual tape and TCT objects Grid access to all logical volume and object data that is independent of where it exists An all-flash TS7770 option for improved performance Full Advanced Object Store Grid Cloud support of DS8000 Transparent Cloud Tier Full AES256 encryption for data that is in-flight and at-rest Tight integration with IBM Z® and DFSMS policy management DS8000 Object Store AES256 in-flight encryption and compression Regulatory compliance through Logical WORM and LWORM Retention support Cloud Storage Tier support for archive, logical volume version, and disaster recovery Optional integration with physical tape 16 Gb IBM FICON® throughput that exceeds 5 GBps per TS7700 cluster Grid Resiliency Support with Control Unit Initiated Reconfiguration (CUIR) support IBM Z hosts view up to 3,968 common devices per TS7700 grid TS7770 Cache On-demand feature that is based capacity licensing TS7770 support of SSD within the VED server The TS7700T writes data by policy to physical tape through attachment to high-capacity, high-performance IBM TS1160, IBM TS1150, and IBM TS1140 tape drives that are installed in an IBM TS4500 or TS3500 tape library. The TS7770 models are based on high-performance and redundant IBM POWER9™ technology. They provide improved performance for most IBM Z tape workloads when compared to the previous generations of IBM TS7700.

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey beginning with the basics of data ingestion, processing, and transformation, and ending up with an entire local data platform running Apache Spark, Apache Zeppelin, Apache Kafka, Redis, MySQL, Minio (S3), and Apache Airflow. Apache Spark applications solve a wide range of data problems from traditional data loading and processing to rich SQL-based analysis as well as complex machine learning workloads and even near real-time processing of streaming data. Spark fits well as a central foundation for any data engineering workload. This book will teach you to write interactive Spark applications using Apache Zeppelin notebooks, write and compilereusable applications and modules, and fully test both batch and streaming. You will also learn to containerize your applications using Docker and run and deploy your Spark applications using a variety of tools such as Apache Airflow, Docker and Kubernetes. ​Reading this book will empower you to take advantage of Apache Spark to optimize your data pipelines and teach you to craft modular and testable Spark applications. You will create and deploy mission-critical streaming spark applications in a low-stress environment that paves the way for your own path to production. ​ What You Will Learn Simplify data transformation with Spark Pipelines and Spark SQL Bridge data engineering with machine learning Architect modular data pipeline applications Build reusable application components and libraries Containerize your Spark applications for consistency and reliability Use Docker and Kubernetes to deploy your Spark applications Speed up application experimentation using Apache Zeppelin and Docker Understand serializable structured data and data contracts Harness effective strategies for optimizing data in your data lakes Build end-to-end Spark structured streaming applications using Redis and Apache Kafka Embrace testing for your batch and streaming applications Deploy and monitor your Spark applications Who This Book Is For Professional software engineers who want to take their current skills and apply them to new and exciting opportunities within the data ecosystem, practicing data engineers who are looking for a guiding light while traversing the many challenges of moving from batch to streaming modes, data architects who wish to provide clear and concise direction for how best to harness anduse Apache Spark within their organization, and those interested in the ins and outs of becoming a modern data engineer in today's fast-paced and data-hungry world

IBM TS7700 Release 5.1 Guide

This IBM® Redbooks® publication covers IBM TS7700 R5.1. The IBM TS7700 is part of a family of IBM Enterprise tape products. This book is intended for system architects and storage administrators who want to integrate their storage systems for optimal operation. Building on over 20 years of virtual tape experience, the TS7770 supports the ability to store virtual tape volumes in an object store. The TS7700 supported off loading to physical tape for over two decades. Off loading to physical tape behind a TS7700 is utilized by hundreds of organizations around the world. By using the same hierarchical storage techniques, the TS7700 (TS7770 and TS7760) can also off load to object storage. Because object storage is cloud-based and accessible from different regions, the TS7700 Cloud Storage Tier support essentially allows the cloud to be an extension of the grid. As of this writing, the TS7700C supports the ability to off load to IBM Cloud® Object Storage and Amazon S3. This publication explains features and concepts that are specific to the IBM TS7700 as of release R5.1. The R5.1 microcode level provides IBM TS7700 Cloud Storage Tier enhancements, IBM DS8000® Object Storage enhancements, Management Interface dual control security, and other smaller enhancements. The R5.1 microcode level can be installed on the IBM TS7770 and IBM TS7760 models only. TS7700 provides tape virtualization for the IBM z environment. Tape virtualization can help satisfy the following requirements in a data processing environment: Improved reliability and resiliency Reduction in the time that is needed for the backup and restore process Reduction of services downtime that is caused by physical tape drive and library outages Reduction in cost, time, and complexity by moving primary workloads to virtual tape Increased efficient procedures for managing daily batch, backup, recall, and restore processing On-premises and off-premises object store cloud storage support as an alternative to physical tape for archive and disaster recovery New and existing capabilities of the TS7700 5.1 include the following highlights: Eight-way Grid Cloud, which consists of up to three generations of TS7700 Synchronous and asynchronous replication Full AES256 encryption for replication data that is in-flight and at-rest Tight integration with IBM Z and DFSMS policy management Optional target for DS8000 Transparent Cloud Tier using DFSMS DS8000 Object Store AES256 in-flight encryption and compression Optional Cloud Storage Tier support for archive and disaster recovery 16 Gb IBM FICON® throughput up to 5 GBps per TS7700 cluster IBM Z hosts view up to 3,968 common devices per TS7700 grid Grid access to all data independent of where it exists TS7770 Cache On-demand feature that is based capacity licensing TS7770 support of SSD within the VED server The TS7700T writes data by policy to physical tape through attachment to high-capacity, high-performance IBM TS1150, and IBM TS1140 tape drives that are installed in an IBM TS4500 or TS3500 tape library. The TS7770 models are based on high-performance and redundant IBM POWER9™ technology. They provide improved performance for most IBM Z tape workloads when compared to the previous generations of IBM TS7700.

Learning Spark, 2nd Edition

Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark. Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to: Learn Python, SQL, Scala, or Java high-level Structured APIs Understand Spark operations and SQL Engine Inspect, tune, and debug Spark operations with Spark configurations and Spark UI Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka Perform analytics on batch and streaming data using Structured Streaming Build reliable data pipelines with open source Delta Lake and Spark Develop machine learning pipelines with MLlib and productionize models using MLflow

IBM TS7700 R5.0 Cloud Storage Tier Guide

Building on over 20 years of virtual tape experience, the TS7700 (TS7760, TS7770) now supports the ability to store virtual tape volumes in an object store. This IBM® Redpaper publication helps you set up and configure the cloud object storage support for IBM Cloud™ Object Storage (COS) or Amazon Simple Storage Service (Amazon S3). The TS7700 supported off loading to physical tape for over two decades. Off loading to physical tape behind a TS7700 is used by hundreds of organizations around the world. By using the same hierarchical storage techniques, the TS7700 can also off load to object storage. Because object storage is cloud-based and accessible from different regions, the TS7700 Cloud Storage Tier support essentially allows the cloud to be an extension of the grid. In this IBM Redpaper publication, we provide a brief overview of cloud technology with an emphasis on Object Storage. Object Storage is used by a broad set of technologies, including those technologies that are exclusive to IBM Z®. The aim of this publication is to provide a basic understanding of cloud, Object Storage, and different ways it can be integrated into your environment. This Redpaper is intended for system architects and storage administrators with TS7700 experience who want to add the support of a Cloud Storage Tier to their TS7700 solution. Note: As of this writing, the TS7700C supports the ability to offload to on-premise cloud with IBM Cloud Object Storage and public cloud with Amazon S3.

Achieving Hybrid Cloud Cyber Resiliency with IBM Spectrum Virtualize for Public Cloud

This document is intended to facilitate the approach of achieving the Cyber Resiliency solution for IBM® Spectrum Virtualize for Public Cloud. This solution is designed to protect the data on IBM Spectrum™ Virtualize storage in a hybrid multicloud environment by deploying cloud backup to Amazon S3 using the function Transparent Cloud Tiering .

Mastering Large Datasets with Python

Modern data science solutions need to be clean, easy to read, and scalable. In Mastering Large Datasets with Python, author J.T. Wolohan teaches you how to take a small project and scale it up using a functionally influenced approach to Python coding. You’ll explore methods and built-in Python tools that lend themselves to clarity and scalability, like the high-performing parallelism method, as well as distributed technologies that allow for high data throughput. The abundant hands-on exercises in this practical tutorial will lock in these essential skills for any large-scale data science project. About the Technology Programming techniques that work well on laptop-sized data can slow to a crawl—or fail altogether—when applied to massive files or distributed datasets. By mastering the powerful map and reduce paradigm, along with the Python-based tools that support it, you can write data-centric applications that scale efficiently without requiring codebase rewrites as your requirements change. About the Book Mastering Large Datasets with Python teaches you to write code that can handle datasets of any size. You’ll start with laptop-sized datasets that teach you to parallelize data analysis by breaking large tasks into smaller ones that can run simultaneously. You’ll then scale those same programs to industrial-sized datasets on a cluster of cloud servers. With the map and reduce paradigm firmly in place, you’ll explore tools like Hadoop and PySpark to efficiently process massive distributed datasets, speed up decision-making with machine learning, and simplify your data storage with AWS S3. What's Inside An introduction to the map and reduce paradigm Parallelization with the multiprocessing module and pathos framework Hadoop and Spark for distributed computing Running AWS jobs to process large datasets About the Reader For Python programmers who need to work faster with more data. About the Author J. T. Wolohan is a lead data scientist at Booz Allen Hamilton, and a PhD researcher at Indiana University, Bloomington. Quotes A clear and efficient path to mastery of the map and reduce paradigm for developers of all levels. - Justin Fister, GrammarBot An amazing book for anybody looking to add parallel processing and the map/reduce pattern to their toolkit. - Gary Bake, Radius Payment Solutions Learn fundamentals of MapReduce and other core concepts and save money on expensive hardware! - Al Krinker, USPTO A comprehensive guide to the fundamentals of efficient Python data processing. - Craig Pfeifer, MITRE Corporation

IBM TS7700 Release 4.2 Guide

This IBM® Redbooks® publication covers IBM TS7700 R4.2. The IBM TS7700 is part of a family of IBM Enterprise tape products. This book is intended for system architects and storage administrators who want to integrate their storage systems for optimal operation. Building on over 20 years of virtual tape experience, the TS7760 now supports the ability to store virtual tape volumes in an object store. The TS7700 has supported off loading to physical tape for over two decades. Off loading to physical tape behind a TS7700 is utilized by hundreds of organizations around the world. Using the same hierarchical storage techniques, the TS7700 can also off load to object storage. Given object storage is cloud based and accessible from different regions, the TS7760 Cloud Storage Tier support essentially allows the cloud to be an extension of the grid. As of the release of this document, the TS7760C supports the ability to off load to IBM Cloud Object Storage as well as Amazon S3. To learn about the TS7760 cloud storage tier function, planning, implementation, best practices, and support see IBM Redpaper IBM TS7760 R4.2 Cloud Storage Tier Guide, redp-5514 at: http://www.redbooks.ibm.com/abstracts/redp5514.html The IBM TS7700 offers a modular, scalable, and high-performance architecture for mainframe tape virtualization for the IBM Z® environment. It is a fully integrated, tiered storage hierarchy of disk and tape. This storage hierarchy is managed by robust storage management microcode with extensive self-management capability. It includes the following advanced functions: Improved reliability and resiliency Reduction in the time that is needed for the backup and restore process Reduction of services downtime that is caused by physical tape drive and library outages Reduction in cost, time, and complexity by moving primary workloads to virtual tape More efficient procedures for managing daily backup and restore processing Infrastructure simplification through reduction of the number of physical tape libraries, drives, and media TS7700 delivers the following new capabilities: TS7760C supports the ability to off load to IBM Cloud Object Storage as well as Amazon S3 8-way Grid Cloud consisting of any generation of TS7700 Synchronous and asynchronous replication Tight integration with IBM Z and DFSMS policy management Optional Transparent Cloud Tiering Optional integration with physical tape Cumulative 16Gb FICON throughput up to 4.8GB/s 8 IBM Z hosts view up to 496 8 equivalent devices Grid access to all data independent of where it exists The TS7760T writes data by policy to physical tape through attachment to high-capacity, high-performance IBM TS1150 and IBM TS1140 tape drives installed in an IBM TS4500 or TS3500 tape library. The TS7760 models are based on high-performance and redundant IBM POWER8® technology. They provide improved performance for most IBM Z tape workloads when compared to the previous generations of IBM TS7700.

Hands-On Big Data Analytics with PySpark

Dive into the exciting world of big data analytics with 'Hands-On Big Data Analytics with PySpark'. This practical guide offers you the tools and knowledge to tackle massive datasets using PySpark. By exploring real-world examples, you'll learn to unleash the power of distributed systems to analyze and manipulate data at scale. What this Book will help me do Master using PySpark to handle large and complex datasets efficiently and effectively. Develop skills to optimize Spark programs using best practices like reducing shuffle operations. Learn to set up a PySpark environment, process data from platforms like HDFS, Hive, and S3. Enhance your data analytics capabilities by implementing powerful SQL queries and data visualizations. Understand testing and debugging techniques to build reliable, production-quality data pipelines. Author(s) Authored by Rudy Lai and Bartłomiej Potaczek, both seasoned data engineers and authors in the big data field. Rudy and Bartłomiej bring their extensive experience working with distributed systems and scalable data architectures into this book. Their approach is hands-on, focusing on real-world applications and best practices. Who is it for? This book is tailored for data scientists, engineers, and developers eager to advance their big data analytics capabilities. Whether you're new to big data or experienced with other analytics frameworks, this book will equip you with practical knowledge to utilize PySpark for scalable data solutions.

Integration of IBM Aspera Sync with IBM Spectrum Scale: Protecting and Sharing Files Globally

Economic globalization requires data to be available globally. With most data stored in file systems, solutions to make this data globally available become more important. Files that are in file systems can be protected or shared by replicating these files to another file system that is in a remote location. The remote location might be just around the corner or in a different country. Therefore, the techniques that are used to protect and share files must account for long distances and slow and unreliable wide area network (WAN) connections. IBM® Spectrum Scale is a scalable clustered file system that can be used to store all kinds of unstructured data. It provides open data access by way of Network File System (NFS); Server Message Block (SMB); POSIX Object Storage APIs, such as S3 and OpenStack Swift; and the Hadoop Distributed File System (HDFS) for accessing and sharing data. The IBM Aspera® file transfer solution (IBM Aspera Sync) provides predictable and reliable data transfer across large distance for small and large files. The combination of both can be used for global sharing and protection of data. This IBM Redpaper™ publication describes how IBM Aspera Sync can be used to protect and share data that is stored in IBM Spectrum™ Scale file systems across large distances of several hundred to thousands of miles. We also explain the integration of IBM Aspera Sync with IBM Spectrum Scale™ and differentiate it from solutions that are built into IBM Spectrum Scale for protection and sharing. We also describe different use cases for IBM Aspera Sync with IBM Spectrum Scale.

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies. Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing. Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard. What You’ll Learn Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing Turbocharge Spark with Alluxio, a distributed in-memory storage platform Deploy big data in the cloud using Cloudera Director Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard Who This Book Is For BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics

Enabling Hybrid Cloud Storage for IBM Spectrum Scale Using Transparent Cloud Tiering

This IBM® Redbooks® publication provides information to help you with the sizing, configuration, and monitoring of hybrid cloud solutions using the transparent cloud tiering (TCT) functionality of IBM Spectrum™ Scale. IBM Spectrum Scale™ is a scalable data, file, and object management solution that provides a global namespace for large data sets and several enterprise features. The IBM Spectrum Scale feature called transparent cloud tiering allows cloud object storage providers, such as IBM Cloud™ Object Storage, IBM Cloud, and Amazon S3, to be used as a storage tier for IBM Spectrum Scale. Transparent cloud tiering can help cut storage capital and operating costs by moving data that does not require local performance to an on-premise or off-premise cloud object storage provider. Transparent cloud tiering reduces the complexity of cloud object storage by making data transfers transparent to the user or application. This capability can help you adapt to a hybrid cloud deployment model where active data remains directly accessible to your applications and inactive data is placed in the correct cloud (private or public) automatically through IBM Spectrum Scale policies. This publication is intended for IT architects, IT administrators, storage administrators, and those wanting to learn more about sizing, configuration, and monitoring of hybrid cloud solutions using IBM Spectrum Scale and transparent cloud tiering.

IBM Spectrum Scale Best Practices for Genomics Medicine Workloads

Advancing the science of medicine by targeting a disease more precisely with treatment specific to each patient relies on access to that patient's genomics information and the ability to process massive amounts of genomics data quickly. Although genomics data is becoming a critical source for precision medicine, it is expected to create an expanding data ecosystem. Therefore, hospitals, genome centers, medical research centers, and other clinical institutes need to explore new methods of storing, accessing, securing, managing, sharing, and analyzing significant amounts of data. Healthcare and life sciences organizations that are running data-intensive genomics workloads on an IT infrastructure that lacks scalability, flexibility, performance, management, and cognitive capabilities also need to modernize and transform their infrastructure to support current and future requirements. IBM® offers an integrated solution for genomics that is based on composable infrastructure. This solution enables administrators to build an IT environment in a way that disaggregates the underlying compute, storage, and network resources. Such a composable building block based solution for genomics addresses the most complex data management aspect and allows organizations to store, access, manage, and share huge volumes of genome sequencing data. IBM Spectrum™ Scale is software-defined storage that is used to manage storage and provide massive scale, a global namespace, and high-performance data access with many enterprise features. IBM Spectrum Scale™ is used in clustered environments, provides unified access to data via file protocols (POSIX, NFS, and SMB) and object protocols (Swift and S3), and supports analytic workloads via HDFS connectors. Deploying IBM Spectrum Scale and IBM Elastic Storage™ Server (IBM ESS) as a composable storage building block in a Genomics Next Generation Sequencing deployment offers key benefits of performance, scalability, analytics, and collaboration via multiple protocols. This IBM Redpaper™ publication describes a composable solution with detailed architecture definitions for storage, compute, and networking services for genomics next generation sequencing that enable solution architects to benefit from tried-and-tested deployments, to quickly plan and design an end-to-end infrastructure deployment. The preferred practices and fully tested recommendations described in this paper are derived from running GATK Best Practices work flow from the Broad Institute. The scenarios provide all that is required, including ready-to-use configuration and tuning templates for the different building blocks (compute, network, and storage), that can enable simpler deployment and that can enlarge the level of assurance over the performance for genomics workloads. The solution is designed to be elastic in nature, and the disaggregation of the building blocks allows IT administrators to easily and optimally configure the solution with maximum flexibility. The intended audience for this paper is technical decision makers, IT architects, deployment engineers, and administrators who are working in the healthcare domain and who are working on genomics-based workloads.