talk-data.com talk-data.com

Topic

data-engineering

3395

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

3395 activities · Newest first

Getting DataOps Right
book
by Mark Marinelli (Tamr) , Michael Stonebraker (Paradigm4; Advisor to VoltDB; former Adjunct Professor at MIT) , Andy Palmer (Tamr) , Nik Bates-Haus , Liam Cleary

Many large organizations have accumulated dozens of disconnected data sources to serve different lines of business over the years. These applications might be useful to one area of the enterprise, but they’re usually inaccessible to other data consumers in the organization. In this short report, five data industry thought leaders explore DataOps—the automated, process-oriented methodology for making clean, reliable data available to teams throughout your company. Andy Palmer, Michael Stonebraker, Nik Bates-Haus, Liam Cleary, and Mark Marinelli from Tamr use real-world examples to explain how DataOps works. DataOps is as much about changing people’s relationship to data as it is about technology, infrastructure, and process. This report provides an organizational approach to implementing this discipline in your company—including various behavioral, process, and technology changes. Through individual essays, you’ll learn how to: Move toward scalable data unification (Michael Stonebraker) Understand DataOps as a discipline (Nik Bates-Haus) Explore the key principles of a DataOps ecosystem (Andy Palmer) Learn the key components of a DataOps ecosystem (Andy Palmer) Build a DataOps toolkit (Liam Cleary) Build a team and prepare for future trends (Mark Marinelli)

Operationalizing the Data Lake

Big data and advanced analytics have increasingly moved to the cloud as organizations pursue actionable insights and data-driven products using the growing amounts of information they collect. But few companies have truly operationalized data so it’s usable for the entire organization. With this pragmatic ebook, engineers, architects, and data managers will learn how to build and extract value from a data lake in the cloud and leverage the compute power and scalability of a cloud-native data platform to put your company’s vast data trove into action. Holden Ackerman and Jon King of Qubole take you through the basics of building a data lake operation, from people to technology, employing multiple technologies and frameworks in a cloud-native data platform. You'll dive into the tools and processes you need for the entire lifecycle of a data lake, from data preparation, storage, and management to distributed computing and analytics. You’ll also explore the unique role that each member of your data team needs to play as you migrate to your cloud-native data platform. Leverage your data effectively through a single source of truth Understand the importance of building a self-service culture for your data lake Define the structure you need to build a data lake in the cloud Implement financial governance and data security policies for your data lake through a cloud-native data platform Identify the tools you need to manage your data infrastructure Delineate the scope, usage rights, and best tools for each team working with a data lake—analysts, data scientists, data engineers, and security professionals, among others

Rebuilding Reliable Data Pipelines Through Modern Tools

When data-driven applications fail, identifying the cause is both challenging and time-consuming—especially as data pipelines become more and more complex. Hunting for the root cause of application failure from messy, raw, and distributed logs is difficult for performance experts and a nightmare for data operations teams. This report examines DataOps processes and tools that enable you to manage modern data pipelines efficiently. Author Ted Malaska describes a data operations framework and shows you the importance of testing and monitoring to plan, rebuild, automate, and then manage robust data pipelines—whether it’s in the cloud, on premises, or in a hybrid configuration. You’ll also learn ways to apply performance monitoring software and AI to your data pipelines in order to keep your applications running reliably. You’ll learn: How performance management software can reduce the risk of running modern data applications Methods for applying AI to provide insights, recommendations, and automation to operationalize big data systems and data applications How to plan, migrate, and operate big data workloads and data pipelines in the cloud and in hybrid deployment models

Professional Azure SQL Database Administration - Second Edition

Professional Azure SQL Database Administration serves as your comprehensive guide to mastering the management and optimization of cloud-based Azure SQL Database solutions. With the differences and unique features of Azure SQL Database compared to the on-premise SQL Server, this book offers a clear roadmap to efficiently migrate, secure, scale, and maintain these databases in the cloud. What this Book will help me do Understand the differences between Azure SQL Database and on-premise SQL Server and their practical implications. Learn techniques to migrate existing SQL Server databases to Azure SQL Database seamlessly. Discover advanced ways to optimize database performance and scalability leveraging cloud capabilities. Master security strategies for Azure SQL databases, including backup, disaster recovery, and automated tasks. Develop proficiency in using tools such as PowerShell to automate and manage routine database administration tasks. Author(s) Ahmad Osama is an experienced database professional and author specializing in SQL Server and Azure SQL Database administration. With a robust background in database migration, maintenance, and performance tuning, Ahmad expertly bridges the gap between theory and practice. His approachable writing style makes complex database topics accessible to professionals seeking to expand their expertise. Who is it for? Professional Azure SQL Database Administration is an essential resource for database administrators, developers, and IT professionals keen on developing their knowledge about Azure SQL Database administration and cloud database solutions. Whether you're transitioning from traditional SQL Server environments or looking to optimize your database strategies in the cloud, this book caters to professionals with intermediate to advanced experience in database management and programming with SQL.

IBM Spectrum Virtualize: Hot-Spare Node and NPIV Target Ports

The use of N_Port ID Virtualization (NPIV) to provide host-only ports (NPIV target ports) and spare nodes improves the host failover characteristics by separating out host communications from communication tasks on the same port and providing standby hardware, which can be automatically introduced into the cluster to reintroduce redundancy. Because the host ports are not used for internode communications, they can freely move between nodes, and this includes spare nodes that are added to the cluster automatically. This IBM® Redpaper™ publication describes the use of the IBM Spectrum™ Virtualize Hot-Spare Node function to provide a high availability storage infrastructure. This paper focuses on the functional behavior of hot-spare node when subjected to various failure conditions. This paper does not provide the details necessary to implement the reference architectures (although some implementation detail is provided).

IBM Spectrum Scale: Big Data and Analytics Solution Brief

This IBM® Redguide™ publication describes big data and analytics deployments that are built on IBM Spectrum Scale™. IBM Spectrum Scale is a proven enterprise-level distributed file system that is a high-performance and cost-effective alternative to Hadoop Distributed File System (HDFS) for Hadoop analytics services. IBM Spectrum Scale includes NFS, SMB, and Object services and meets the performance that is required by many industry workloads, such as technical computing, big data, analytics, and content management. IBM Spectrum Scale provides world-class, web-based storage management with extreme scalability, flash accelerated performance, and automatic policy-based storage tiering from flash through disk to the cloud, which reduces storage costs up to 90% while improving security and management efficiency in cloud, big data, and analytics environments. This Redguide publication is intended for technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing Hadoop analytics services and are interested in learning about the benefits of the use of IBM Spectrum Scale as an alternative to HDFS.

SAP HANA on IBM Power Systems: High Availability and Disaster Recovery Implementation Updates

This IBM® Redbooks® publication updates Implementing High Availability and Disaster Recovery Solutions with SAP HANA on IBM Power Systems, REDP-5443 with the latest technical content that describes how to implement an SAP HANA on IBM Power Systems™ high availability (HA) and disaster recovery (DR) solution by using theoretical knowledge and sample scenarios. This book describes how all the pieces of the reference architecture work together (IBM Power Systems servers, IBM Storage servers, IBM Spectrum™ Scale, IBM PowerHA® SystemMirror® for Linux, IBM VM Recovery Manager DR for Power Systems, and Linux distributions) and demonstrates the resilience of SAP HANA with IBM Power Systems servers. This publication is for architects, brand specialists, distributors, resellers, and anyone developing and implementing SAP HANA on IBM Power Systems integration, automation, HA, and DR solutions. This publication provides documentation to transfer the how-to-skills to the technical teams, and documentation to the sales team.

IBM Personal Communications and IBM z/OS TTLS Enablement: Technical Enablement Series

The purpose of this document is to complete the task of introducing Transport Layer Security to z/OS® so IBM Personal Communications (PCOMM) uses TLS security. This document walks you through enabling Tunneled Transport Layer Security (TTLS) on your IBM z/OS for use with a PCOMM TN3270 connection. When you complete this task, you require a certificate to access your TN3270 PCOMM session. You work with the following products and components: TN3270 TCPIP PAGENT INET (maybe) IBM RACF® This document assumes that the reader has extensive knowledge of z/OS security administration and these products and components. This document is part of the Technical Enablement Series that was created at the IBM Client Experience Centers.

IBM FlashSystem 900 Model AE3 Product Guide

Today's global organizations depend on the ability to unlock business insights from massive volumes of data. Now, with IBM® FlashSystem 900 Model AE3, they can make faster decisions based on real-time insights. Thus, they unleash the power of demanding applications, including these: Online transaction processing (OLTP) and analytical databases Virtual desktop infrastructures (VDIs) Technical computing applications Cloud environments Easy to deploy and manage, IBM FlashSystem® 900 Model AE3 is designed to accelerate the applications that drive your business. Powered by IBM FlashCore® Technology, IBM FlashSystem Model AE3 provides the following characteristics: Accelerate business-critical workloads, real-time analytics, and cognitive applications with the consistent microsecond latency and extreme reliability of IBM FlashCore technology Improve performance and help lower cost with new inline data compression Help reduce capital and operational expenses with IBM enhanced 3D triple-level cell (3D TLC) flash Protect critical data assets with patented IBM Variable Stripe RAID™ Power faster insights with IBM FlashCore including hardware-accelerated nonvolatile memory (NVM) architecture, purpose-engineered IBM MicroLatency® modules and advanced flash management FlashSystem 900 Model AE3 can be configured in capacity points as low as 14.4 TB to 180 TB usable and up to 360 TB effective capacity after RAID 5 protection and compression. You can couple this product with either 16 Gbps, 8 Gbps Fibre Channel, 16 Gbps NVMe over Fibre Channel, or 40 Gbps InfiniBand connectivity. Thus, the IBM FlashSystem 900 Model AE3 provides extreme performance to existing and next generation infrastructure.

Implementing the IBM System Storage SAN Volume Controller with IBM Spectrum Virtualize V8.2.1

This IBM® Redbooks® publication is a detailed technical guide to the IBM System Storage® SAN Volume Controller (SVC), which is powered by IBM Spectrum™ Virtualize V8.2.1. IBM SAN Volume Controller is a virtualization appliance solution that maps virtualized volumes that are visible to hosts and applications to physical volumes on storage devices. Each server within the storage area network (SAN) has its own set of virtual storage addresses that are mapped to physical addresses. If the physical addresses change, the server continues running by using the same virtual addresses that it had before. Therefore, volumes or storage can be added or moved while the server is still running. The IBM virtualization technology improves the management of information at the block level in a network, which enables applications and servers to share storage devices on a network.

IBM Hybrid Solution for Scalable Data Solutions using IBM Spectrum Scale

This document is intended to facilitate the deployment of the scalable hybrid cloud solution for data agility and collaboration using IBM® Spectrum Scale across multiple public clouds. To complete the tasks it describes, you must understand IBM Spectrum Scale and IBM Spectrum Scale Active File Management (AFM). The information in this document is distributed on an basis without any warranty that is either expressed or implied. Support assistance for the use of this material is limited to situations where IBM Spectrum Scale or IBM Spectrum Scale Active File Management are supported and entitled, and where the issues are specific to a blueprint implementation.

Big Data Simplified
"Big Data Simplified blends technology with strategy and delves into applications of big data in specialized areas, such as recommendation engines, data science and Internet of Things (IoT) and enables a practitioner to make the right technology choice. The steps to strategize a big data implementation are also discussed in detail. This book presents a holistic approach to the topic, covering a wide landscape of big

data technologies like Hadoop 2.0 and package implementations, such as Cloudera. In-depth discussion of associated technologies, such as MapReduce, Hive, Pig, Oozie, ApacheZookeeper, Flume, Kafka, Spark, Python and NoSQL databases like Cassandra, MongoDB, GraphDB, etc., is also included.

Multicloud Storage as a Service using vRealize Automation and IBM Spectrum Storage

This document is intended to facilitate the deployment of the Multicloud Solution for Business Continuity and Storage as service by using IBM Spectrum Virtualize for Public Cloud on Amazon Web Services (AWS). To complete the tasks it describes, you must understand IBM FlashSystem 9100, IBM Spectrum Virtualize for Public Cloud, IBM Spectrum Connect, VMware vRealize Orchestrator, and vRealize Automation and AWS Cloud. The information in this document is distributed on an "as is" basis without any warranty that is either expressed or implied. Support assistance for the use of this material is limited to situations where IBM Storwize or IBM FlashSystem storage devices are supported and entitled and where the issues are specific to a blueprint implementation.

Streaming Data

Managers and staff responsible for planning, hiring, and allocating resources need to understand how streaming data can fundamentally change their organizations. Companies everywhere are disrupting business, government, and society by using data and analytics to shape their business. Even if you don’t have deep knowledge of programming or digital technology, this high-level introduction brings data streaming into focus. You won’t find math or programming details here, or recommendations for particular tools in this rapidly evolving space. But you will explore the decision-making technologies and practices that organizations need to process streaming data and respond to fast-changing events. By describing the principles and activities behind this new phenomenon, author Andy Oram shows you how streaming data provides hidden gems of information that can transform the way your business works. Learn where streaming data comes from and how companies put it to work Follow a simple data processing project from ingesting and analyzing data to presenting results Explore how (and why) big data processing tools have evolved from MapReduce to Kubernetes Understand why streaming data is particularly useful for machine learning projects Learn how containers, microservices, and cloud computing led to continuous integration and DevOps

Deep Learning for Search

Deep Learning for Search teaches you how to improve the effectiveness of your search by implementing neural network-based techniques. By the time you're finished with the book, you'll be ready to build amazing search engines that deliver the results your users need and that get better as time goes on! About the Technology Deep learning handles the toughest search challenges, including imprecise search terms, badly indexed data, and retrieving images with minimal metadata. And with modern tools like DL4J and TensorFlow, you can apply powerful DL techniques without a deep background in data science or natural language processing (NLP). This book will show you how. About the Book Deep Learning for Search teaches you to improve your search results with neural networks. You’ll review how DL relates to search basics like indexing and ranking. Then, you’ll walk through in-depth examples to upgrade your search with DL techniques using Apache Lucene and Deeplearning4j. As the book progresses, you’ll explore advanced topics like searching through images, translating user queries, and designing search engines that improve as they learn! What's Inside Accurate and relevant rankings Searching across languages Content-based image search Search with recommendations About the Reader For developers comfortable with Java or a similar language and search basics. No experience with deep learning or NLP needed. About the Author Tommaso Teofili is a software engineer with a passion for open source and machine learning. As a member of the Apache Software Foundation, he contributes to a number of open source projects, ranging from topics like information retrieval (such as Lucene and Solr) to natural language processing and machine translation (including OpenNLP, Joshua, and UIMA). He currently works at Adobe, developing search and indexing infrastructure components, and researching the areas of natural language processing, information retrieval, and deep learning. He has presented search and machine learning talks at conferences including BerlinBuzzwords, International Conference on Computational Science, ApacheCon, EclipseCon, and others. You can find him on Twitter at @tteofili. Quotes A practical approach that shows you the state of the art in using neural networks, AI, and deep learning in the development of search engines. - From the Foreword by Chris Mattmann, NASA JPL A thorough and thoughtful synthesis of traditional search and the latest advancements in deep learning. - Greg Zanotti, Marquette Partners A well-laid-out deep dive into the latest technologies that will take your search engine to the next level. - Andrew Wyllie, Thynk Health Hands-on exercises teach you how to master deep learning for search-based products. - Antonio Magnaghi, System1

IBM Storage Solutions for Blockchain Platform Version 1.2

This Blueprint is intended to define the infrastructure that is required for a blockchain remote peer and to facilitate the deployment of IBM Blockchain Platform on IBM Cloud Private using that infrastructure. This infrastructure includes the necessary document handler components, such as IBM Blockchain Document Store, and covers the required storage for on-chain and off-chain blockchain data. To complete these tasks, you must have a basic understanding of each of the used components or have access the correct educational material to gain that knowledge.

IBM FlashSystem A9000 Product Guide (Version 12.3.2)

This IBM® Redbooks® Product Guide is an overview of the main characteristics, features, and technology that are used in IBM FlashSystem® A9000Model 425, with IBM FlashSystem A9000 Software V12.3.2. Software version 12.3.2, with Hyper-Scale Manager version 5.6 or later, introduces support for VLAN tagging and port trunking. IBM FlashSystem A9000 storage system uses the IBM FlashCore® technology to help realize higher capacity and improved response times over disk-based systems and other competing flash and solid-state drive (SSD)-based storage. The extreme performance of IBM FlashCore technology with a grid architecture and comprehensive data reduction creates one powerful solution. Whether you are a service provider who requires highly efficient management or an enterprise that is implementing cloud on a budget, FlashSystem A9000 provides consistent and predictable microsecond response times and the simplicity that you need. The A9000 features always on data reduction and now offers intelligent capacity management for deduplication. As a cloud optimized solution, FlashSystem A9000 suits the requirements of public and private cloud providers who require features, such as inline data deduplication, multi-tenancy, and quality of service. It also uses powerful software-defined storage capabilities from IBM Spectrum™ Accelerate, such as Hyper-Scale technology, VMware, and storage container integration.

IBM FlashSystem A9000R Product Guide (Version 12.3.2)

This IBM® Redbooks® Product Guide is an overview of the main characteristics, features, and technology that are used in IBM FlashSystem® A9000R Model 415 and Model 425, with IBM FlashSystem A9000R Software V12.3.2. Software version 12.3.2, with Hyper-Scale Manager version 5.6 or later, introduces support for VLAN tagging and port trunking.. IBM FlashSystem A9000R is a grid-scale, all-flash storage platform designed for industry leaders with rapidly growing cloud storage and mixed workload environments to help drive your business into the cognitive era. FlashSystem A9000R provides consistent, extreme performance for dynamic data at scale, integrating the microsecond latency and high availability of IBM FlashCore® technology. The rack-based offering comes integrated with the world class software features that are built with IBM Spectrum™ Accelerate. For example, comprehensive data reduction, including inline pattern removal, data deduplication, and compression, helps lower total cost of ownership (TCO) while the grid architecture and IBM Hyper-Scale framework simplify and automate storage administration. The A9000R features always on data reduction and now offers intelligent capacity management for deduplication. Ready for the cloud and well-suited for large deployments, FlashSystem A9000R delivers predictable high performance and ultra-low latency, even under heavy workloads with full data reduction enabled. As a result, the grid-scale architecture maintains this performance by automatically self-optimizing workloads across all storage resources without manual intervention.

Managing Your Data Science Projects: Learn Salesmanship, Presentation, and Maintenance of Completed Models

At first glance, the skills required to work in the data science field appear to be self-explanatory. Do not be fooled. Impactful data science demands an interdisciplinary knowledge of business philosophy, project management, salesmanship, presentation, and more. In Managing Your Data Science Projects, author Robert de Graaf explores important concepts that are frequently overlooked in much of the instructional literature that is available to data scientists new to the field. If your completed models are to be used and maintained most effectively, you must be able to present and sell them within your organization in a compelling way. The value of data science within an organization cannot be overstated. Thus, it is vital that strategies and communication between teams are dexterously managed. Three main ways that data science strategy is used in a company is to research its customers, assess risk analytics, and log operational measurements. These all require different managerial instincts, backgrounds, and experiences, and de Graaf cogently breaks down the unique reasons behind each. They must align seamlessly to eventually be adopted as dynamic models. Data science is a relatively new discipline, and as such, internal processes for it are not as well-developed within an operational business as others. With Managing Your Data Science Projects, you will learn how to create products that solve important problems for your customers and ensure that the initial success is sustained throughout the product’s intended life. Your users will trust you and your models, and most importantly, you will be a more well-rounded and effectual data scientist throughout your career. Who This Book Is For Early-career data scientists, managers of data scientists, and those interested in entering the fieldof data science