talk-data.com talk-data.com

Topic

storage-repositories

100

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

100 activities · Newest first

Storage Systems

Storage Systems: Organization, Performance, Coding, Reliability and Their Data Processing was motivated by the 1988 Redundant Array of Inexpensive/Independent Disks proposal to replace large form factor mainframe disks with an array of commodity disks. Disk loads are balanced by striping data into strips—with one strip per disk— and storage reliability is enhanced via replication or erasure coding, which at best dedicates k strips per stripe to tolerate k disk failures. Flash memories have resulted in a paradigm shift with Solid State Drives (SSDs) replacing Hard Disk Drives (HDDs) for high performance applications. RAID and Flash have resulted in the emergence of new storage companies, namely EMC, NetApp, SanDisk, and Purestorage, and a multibillion-dollar storage market. Key new conferences and publications are reviewed in this book.The goal of the book is to expose students, researchers, and IT professionals to the more important developments in storage systems, while covering the evolution of storage technologies, traditional and novel databases, and novel sources of data. We describe several prototypes: FAWN at CMU, RAMCloud at Stanford, and Lightstore at MIT; Oracle's Exadata, AWS' Aurora, Alibaba's PolarDB, Fungible Data Center; and author's paper designs for cloud storage, namely heterogeneous disk arrays and hierarchical RAID. Surveys storage technologies and lists sources of data: measurements, text, audio, images, and video Familiarizes with paradigms to improve performance: caching, prefetching, log-structured file systems, and merge-trees (LSMs) Describes RAID organizations and analyzes their performance and reliability Conserves storage via data compression, deduplication, compaction, and secures data via encryption Specifies implications of storage technologies on performance and power consumption Exemplifies database parallelism for big data, analytics, deep learning via multicore CPUs, GPUs, FPGAs, and ASICs, e.g., Google's Tensor Processing Units

Data Lakes For Dummies

Take a dive into data lakes “Data lakes” is the latest buzz word in the world of data storage, management, and analysis. Data Lakes For Dummies decodes and demystifies the concept and helps you get a straightforward answer the question: “What exactly is a data lake and do I need one for my business?” Written for an audience of technology decision makers tasked with keeping up with the latest and greatest data options, this book provides the perfect introductory survey of these novel and growing features of the information landscape. It explains how they can help your business, what they can (and can’t) achieve, and what you need to do to create the lake that best suits your particular needs. With a minimum of jargon, prolific tech author and business intelligence consultant Alan Simon explains how data lakes differ from other data storage paradigms. Once you’ve got the background picture, he maps out ways you can add a data lake to your business systems; migrate existing information and switch on the fresh data supply; clean up the product; and open channels to the best intelligence software for to interpreting what you’ve stored. Understand and build data lake architecture Store, clean, and synchronize new and existing data Compare the best data lake vendors Structure raw data and produce usable analytics Whatever your business, data lakes are going to form ever more prominent parts of the information universe every business should have access to. Dive into this book to start exploring the deep competitive advantage they make possible—and make sure your business isn’t left standing on the shore.

Distributed Data Systems with Azure Databricks

In 'Distributed Data Systems with Azure Databricks', you will explore the capabilities of Microsoft Azure Databricks as a platform for building and managing big data pipelines. Learn how to process, transform, and analyze data at scale while developing expertise in training distributed machine learning models and integrating them into enterprise workflows. What this Book will help me do Design and implement Extract, Transform, Load (ETL) pipelines using Azure Databricks. Conduct distributed training of machine learning models using TensorFlow and Horovod. Integrate Azure Databricks with Azure Data Factory for optimized data pipeline orchestration. Utilize Delta Engine for efficient querying and analysis of data within Delta Lake. Employ Databricks Structured Streaming to manage real-time production-grade data flows. Author(s) None Palacio is an experienced data engineer and cloud computing specialist, with extensive knowledge of the Microsoft Azure platform. With years of practical application of Databricks in enterprise settings, Palacio provides clear, actionable insights through relatable examples. They bring a passion for innovative solutions to the field of big data automation. Who is it for? This book is ideal for data engineers, machine learning engineers, and software developers looking to master Azure Databricks for large-scale data processing and analysis. Readers should have basic familiarity with cloud platforms, understanding of data pipelines, and a foundational grasp of Python and machine learning concepts. It is perfect for those wanting to create scalable and manageable data workflows.

Automating the Modern Data Warehouse

The opportunity to modernize and improve the enterprise data warehouse is one of the best reasons for moving your application to the cloud. A data warehouse can access a greater diversity of use cases and practices than is possible in an existing environment. In this report, researcher and analyst Stephen Swoyer offers a comprehensive overview of the benefits and challenges of implementing a cloud-based data warehouse. Senior IT decision makers, chief data officers, and data professionals will learn about the shifts and new trends in the data management landscape. Explore ways to improve data management, build a data warehouse strategy, and learn how to modernize a data warehouse effectively. Understand how AI, machine learning, self-service data integration, and built-in developer-oriented services have transformed the data warehouse role Use data warehouses to work with cloud-based data lakes for end-to-end data management and data governance Explore how data warehouse platforms as a service (PaaS) pave the way to automation Migrate, manage, and secure a data warehouse in a hybrid or multicloud environment

What Is a Data Lake?

A revolution is occurring in data management regarding how data is collected, stored, processed, governed, managed, and provided to decision makers. The data lake is a popular approach that harnesses the power of big data and marries it with the agility of self-service. With this report, IT executives and data architects will focus on the technical aspects of building a data lake for your organization. Alex Gorelik from Facebook explains the requirements for building a successful data lake that business users can easily access whenever they have a need. You'll learn the phases of data lake maturity, common mistakes that lead to data swamps, and the importance of aligning data with your company's business strategy and gaining executive sponsorship. You'll explore: The ingredients of modern data lakes, such as the use of different ingestion methods for different data formats, and the importance of the three Vs: volume, variety, and velocity Building blocks of successful data lakes, including data ingestion, integration, persistence, data governance, and business intelligence and self-service analytics State-of-the-art data lake architectures offered by Amazon Web Services, Microsoft Azure, and Google Cloud

Data Lake Analytics on Microsoft Azure: A Practitioner's Guide to Big Data Engineering

Get a 360-degree view of how the journey of data analytics solutions has evolved from monolithic data stores and enterprise data warehouses to data lakes and modern data warehouses. You will This book includes comprehensive coverage of how: To architect data lake analytics solutions by choosing suitable technologies available on Microsoft Azure The advent of microservices applications covering ecommerce or modern solutions built on IoT and how real-time streaming data has completely disrupted this ecosystem These data analytics solutions have been transformed from solely understanding the trends from historical data to building predictions by infusing machine learning technologies into the solutions Data platform professionals who have been working on relational data stores, non-relational data stores, and big data technologies will find the content in this book useful. The book also can help you start your journey into the data engineer world as it provides an overview of advanced data analytics and touches on data science concepts and various artificial intelligence and machine learning technologies available on Microsoft Azure. What Will You Learn You will understand the: Concepts of data lake analytics, the modern data warehouse, and advanced data analytics Architecture patterns of the modern data warehouse and advanced data analytics solutions Phases—such as Data Ingestion, Store, Prep and Train, and Model and Serve—of data analytics solutions and technology choices available on Azure under each phase In-depth coverage of real-time and batch mode data analytics solutions architecture Various managed services available on Azure such as Synapse analytics, event hubs, Stream analytics, CosmosDB, and managed Hadoop services such as Databricks and HDInsight Who This Book Is For Data platform professionals, database architects, engineers, and solution architects

Red Hat OpenShift on Public Cloud with IBM Block Storage

The purpose of this document is to show how to install RedHat OpenShift Container Platform (OCP) on Amazon web services (AWS) public cloud with OpenShift installer, a method that is known as Installer-provisioned infrastructure (IPI). We also describe how to validate the installation of IBM container storage interface (CSI) driver on OCP 4.2 that is installed on AWS. This document also describes the installation of OCP 4.x on AWS with customization and OCP 4.x installation on IBM cloud. This document discusses how to provision internet small computer system interface (iSCSI) storage that is made available by IBM Spectrum® Virtualize for Public Cloud (SVPC) that is deployed on AWS. Finally, the document discusses the use of Red Hat OpenShift command line interface (CLI), OCP web console graphical user interface (GUI), and AWS console.

Data Management at Scale

As data management and integration continue to evolve rapidly, storing all your data in one place, such as a data warehouse, is no longer scalable. In the very near future, data will need to be distributed and available for several technological solutions. With this practical book, you’ll learnhow to migrate your enterprise from a complex and tightly coupled data landscape to a more flexible architecture ready for the modern world of data consumption. Executives, data architects, analytics teams, and compliance and governance staff will learn how to build a modern scalable data landscape using the Scaled Architecture, which you can introduce incrementally without a large upfront investment. Author Piethein Strengholt provides blueprints, principles, observations, best practices, and patterns to get you up to speed. Examine data management trends, including technological developments, regulatory requirements, and privacy concerns Go deep into the Scaled Architecture and learn how the pieces fit together Explore data governance and data security, master data management, self-service data marketplaces, and the importance of metadata

Data Lakes

The concept of a data lake is less than 10 years old, but they are already hugely implemented within large companies. Their goal is to efficiently deal with ever-growing volumes of heterogeneous data, while also facing various sophisticated user needs. However, defining and building a data lake is still a challenge, as no consensus has been reached so far. Data Lakes presents recent outcomes and trends in the field of data repositories. The main topics discussed are the data-driven architecture of a data lake; the management of metadata – supplying key information about the stored data, master data and reference data; the roles of linked data and fog computing in a data lake ecosystem; and how gravity principles apply in the context of data lakes. A variety of case studies are also presented, thus providing the reader with practical examples of data lake management.

SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform

Use this guide to one of SQL Server 2019’s most impactful features—Big Data Clusters. You will learn about data virtualization and data lakes for this complete artificial intelligence (AI) and machine learning (ML) platform within the SQL Server database engine. You will know how to use Big Data Clusters to combine large volumes of streaming data for analysis along with data stored in a traditional database. For example, you can stream large volumes of data from Apache Spark in real time while executing Transact-SQL queries to bring in relevant additional data from your corporate, SQL Server database. Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark. Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis. What You Will Learn Install, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments Analyze large volumes of data directly from SQL Server and/or Apache Spark Manage data stored in HDFS from SQL Server as if it wererelational data Implement advanced analytics solutions through machine learning and AI Expose different data sources as a single logical source using data virtualization Who This Book Is For Data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environments

IBM TS7700 R5.0 Cloud Storage Tier Guide

Building on over 20 years of virtual tape experience, the TS7700 (TS7760, TS7770) now supports the ability to store virtual tape volumes in an object store. This IBM® Redpaper publication helps you set up and configure the cloud object storage support for IBM Cloud™ Object Storage (COS) or Amazon Simple Storage Service (Amazon S3). The TS7700 supported off loading to physical tape for over two decades. Off loading to physical tape behind a TS7700 is used by hundreds of organizations around the world. By using the same hierarchical storage techniques, the TS7700 can also off load to object storage. Because object storage is cloud-based and accessible from different regions, the TS7700 Cloud Storage Tier support essentially allows the cloud to be an extension of the grid. In this IBM Redpaper publication, we provide a brief overview of cloud technology with an emphasis on Object Storage. Object Storage is used by a broad set of technologies, including those technologies that are exclusive to IBM Z®. The aim of this publication is to provide a basic understanding of cloud, Object Storage, and different ways it can be integrated into your environment. This Redpaper is intended for system architects and storage administrators with TS7700 experience who want to add the support of a Cloud Storage Tier to their TS7700 solution. Note: As of this writing, the TS7700C supports the ability to offload to on-premise cloud with IBM Cloud Object Storage and public cloud with Amazon S3.

Multicloud Storage as a Service using VRealize Automation and IBM Spectrum Storage

This document is intended to facilitate the deployment of the Multicloud Solution for Business Continuity and Storage as service by using IBM Spectrum Virtualize for Public Cloud on Amazon Web Services (AWS). To complete the tasks it describes, you must understand IBM FlashSystem 9100, IBM Spectrum Virtualize for Public Cloud, IBM Spectrum Connect, VMware vRealize Orchestrator, and vRealize Automation and AWS Cloud. The information in this document is distributed on an "as is" basis without any warranty that is either expressed or implied. Support assistance for the use of this material is limited to situations where IBM Storwize or IBM FlashSystem storage devices are supported and entitled and where the issues are specific to a blueprint implementation.

Building Big Data Applications

Building Big Data Applications helps data managers and their organizations make the most of unstructured data with an existing data warehouse. It provides readers with what they need to know to make sense of how Big Data fits into the world of Data Warehousing. Readers will learn about infrastructure options and integration and come away with a solid understanding on how to leverage various architectures for integration. The book includes a wide range of use cases that will help data managers visualize reference architectures in the context of specific industries (healthcare, big oil, transportation, software, etc.). Explores various ways to leverage Big Data by effectively integrating it into the data warehouse Includes real-world case studies which clearly demonstrate Big Data technologies Provides insights on how to optimize current data warehouse infrastructure and integrate newer infrastructure matching data processing workloads and requirements

Data Warehousing with Greenplum, 2nd Edition

Data professionals are confronting the most disruptive change since relational databases appeared in the 1980s. SQL is still a major tool for data analytics, but conventional relational database management systems can’t handle the increasing size and complexity of today’s datasets. This updated edition teaches you best practices for Greenplum Database, the open source massively parallel processing (MPP) database that accommodates large sets of nonrelational and relational data. Marshall Presser, field CTO at Pivotal, introduces Greenplum’s approach to data analytics and data-driven decisions, beginning with its shared-nothing architecture. IT managers, developers, data analysts, system architects, and data scientists will all gain from exploring data organization and storage, data loading, running queries, and learning to perform analytics in the database. Discover how MPP and Greenplum will help you go beyond the traditional data warehouse. This ebook covers: Greenplum features, use case examples, and techniques for optimizing use Four Greenplum deployment options to help you balance security, cost, and time to usability Why each networked node in Greenplum’s architecture includes an independent operating system, memory, and storage Additional tools for monitoring, managing, securing, and optimizing query responses in the Pivotal Greenplum commercial database

Operationalizing the Data Lake

Big data and advanced analytics have increasingly moved to the cloud as organizations pursue actionable insights and data-driven products using the growing amounts of information they collect. But few companies have truly operationalized data so it’s usable for the entire organization. With this pragmatic ebook, engineers, architects, and data managers will learn how to build and extract value from a data lake in the cloud and leverage the compute power and scalability of a cloud-native data platform to put your company’s vast data trove into action. Holden Ackerman and Jon King of Qubole take you through the basics of building a data lake operation, from people to technology, employing multiple technologies and frameworks in a cloud-native data platform. You'll dive into the tools and processes you need for the entire lifecycle of a data lake, from data preparation, storage, and management to distributed computing and analytics. You’ll also explore the unique role that each member of your data team needs to play as you migrate to your cloud-native data platform. Leverage your data effectively through a single source of truth Understand the importance of building a self-service culture for your data lake Define the structure you need to build a data lake in the cloud Implement financial governance and data security policies for your data lake through a cloud-native data platform Identify the tools you need to manage your data infrastructure Delineate the scope, usage rights, and best tools for each team working with a data lake—analysts, data scientists, data engineers, and security professionals, among others

Multicloud Storage as a Service using vRealize Automation and IBM Spectrum Storage

This document is intended to facilitate the deployment of the Multicloud Solution for Business Continuity and Storage as service by using IBM Spectrum Virtualize for Public Cloud on Amazon Web Services (AWS). To complete the tasks it describes, you must understand IBM FlashSystem 9100, IBM Spectrum Virtualize for Public Cloud, IBM Spectrum Connect, VMware vRealize Orchestrator, and vRealize Automation and AWS Cloud. The information in this document is distributed on an "as is" basis without any warranty that is either expressed or implied. Support assistance for the use of this material is limited to situations where IBM Storwize or IBM FlashSystem storage devices are supported and entitled and where the issues are specific to a blueprint implementation.

Data Lake Maturity Model

Data is changing everything. Many industries today are being fundamentally transformed through the accumulation and analysis of large quantities of data, stored in diversified but flexible repositories known as data lakes. Whether your company has just begun to think about big data or has already initiated a strategy for handling it, this practical ebook shows you how to plan a successful data lake migration. You’ll learn the value of data lakes, their structure, and the problems they attempt to solve. Using Zaloni’s data lake maturity model, you’ll then explore your organization’s readiness for putting a data lake into action. Do you have the tools and data architectures to support big data analysis? Are your people and processes prepared? The data lake maturity model will help you rate your organization’s readiness. This report includes: The structure and purpose of a data lake Descriptive, predictive, and prescriptive analytics Data lake curation, self-service, and the use of data lake zones How to rate your organization using the data lake maturity model A complete checklist to help you determine your strategic path forward

The Enterprise Big Data Lake

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities. But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises. You’ll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book. Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise. Then, in a collection of essays about data lake implementation, you’ll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries. Get a succinct introduction to data warehousing, big data, and data science Learn various paths enterprises take to build a data lake Explore how to build a self-service model and best practices for providing analysts access to the data Use different methods for architecting your data lake Discover ways to implement a data lake from experts in different industries

Data Where You Want It

Many organizations have begun to rethink the strategy of allowing regional teams to maintain independent databases that are periodically consolidated with the head office. As businesses extend their reach globally, these hierarchical approaches no longer work. Instead, an enterprise’s entire data infrastructure—including multiple types of data persistence—needs to be shared and updated everywhere at the same time with fine-grained control over who has access. This practical report examines the requirements and challenges of constructing a geo-distributed data platform, including examples of specific technologies designed to meet them. Authors Ted Dunning and Ellen Friedman also provide real-world use cases that show how low-latency geo-distribution of very large-scale data and computation provide a competitive edge. With this report, you’ll explore: How replication and mirroring methods for data movement provide the large scale, low latency, and low cost that systems demand The importance of multimaster replication of data streams and databases Advantages (and disadvantages) of cloud neutrality, cloud bursting, and hybrid cloud architecture for transferring data Why effective data governance is a complex process that requires the right tools for controlling and monitoring geo-distributed data How to make containers work for geo-distributed data at scale, even where stateful applications are involved Use cases that demonstrate how telecoms and online advertisers distribute large quantities of data

Streaming Change Data Capture

There are many benefits to becoming a data-driven organization, including the ability to accelerate and improve business decision accuracy through the real-time processing of transactions, social media streams, and IoT data. But those benefits require significant changes to your infrastructure. You need flexible architectures that can copy data to analytics platforms at near-zero latency while maintaining 100% production uptime. Fortunately, a solution already exists. This ebook demonstrates how change data capture (CDC) can meet the scalability, efficiency, real-time, and zero-impact requirements of modern data architectures. Kevin Petrie, Itamar Ankorion, and Dan Potter—technology marketing leaders at Attunity—explain how CDC enables faster and more accurate decisions based on current data and reduces or eliminates full reloads that disrupt production and efficiency. The book examines: How CDC evolved from a niche feature of database replication software to a critical data architecture building block Architectures where data workflow and analysis take place, and their integration points with CDC How CDC identifies and captures source data updates to assist high-speed replication to one or more targets Case studies on cloud-based streaming and streaming to a data lake and related architectures Guiding principles for effectively implementing CDC in cloud, data lake, and streaming environments The Attunity Replicate platform for efficiently loading data across all major database, data warehouse, cloud, streaming, and Hadoop platforms