O'Reilly Data Engineering Books

IBM Power Systems Enterprise AI Solutions

2019-09-25 O'Reilly Amazon

book

Scott Vetter , Andrew Laidlaw , Marcos Quezada , Glen Corneau

data data-engineering IBM ibm-power-systems AI/ML Analytics

This IBM® Redpaper publication helps the line of business (LOB), data science, and information technology (IT) teams develop an information architecture (IA) for their enterprise artificial intelligence (AI) environment. It describes the challenges that are faced by the three roles when creating and deploying enterprise AI solutions, and how they can collaborate for best results. This publication also highlights the capabilities of the IBM Cognitive Systems and AI solutions: IBM Watson® Machine Learning Community Edition IBM Watson Machine Learning Accelerator (WMLA) IBM PowerAI Vision IBM Watson Machine Learning IBM Watson Studio Local IBM Video Analytics H2O Driverless AI IBM Spectrum® Scale IBM Spectrum Discover This publication examines the challenges through five different use case examples: Artificial vision Natural language processing (NLP) Planning for the future Machine learning (ML) AI teaming and collaboration This publication targets readers from LOBs, data science teams, and IT departments, and anyone that is interested in understanding how to build an IA to support enterprise AI development and deployment.

IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences

2019-09-08 O'Reilly Amazon

book

Dino Quintero , Frank N. Lee

data data-engineering IBM AI/ML Analytics Big Data

This IBM® Redpaper publication provides an update to the original description of IBM Reference Architecture for Genomics. This paper expands the reference architecture to cover all of the major vertical areas of healthcare and life sciences industries, such as genomics, imaging, and clinical and translational research. The architecture was renamed IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences to reflect the fact that it incorporates key building blocks for high-performance computing (HPC) and software-defined storage, and that it supports an expanding infrastructure of leading industry partners, platforms, and frameworks. The reference architecture defines a highly flexible, scalable, and cost-effective platform for accessing, managing, storing, sharing, integrating, and analyzing big data, which can be deployed on-premises, in the cloud, or as a hybrid of the two. IT organizations can use the reference architecture as a high-level guide for overcoming data management challenges and processing bottlenecks that are frequently encountered in personalized healthcare initiatives, and in compute-intensive and data-intensive biomedical workloads. This reference architecture also provides a framework and context for modern healthcare and life sciences institutions to adopt cutting-edge technologies, such as cognitive life sciences solutions, machine learning and deep learning, Spark for analytics, and cloud computing. To illustrate these points, this paper includes case studies describing how clients and IBM Business Partners alike used the reference architecture in the deployments of demanding infrastructures for precision medicine. This publication targets technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing life sciences solutions and support.

Learn PySpark: Build Python-based Machine Learning and Deep Learning Models

2019-09-06 O'Reilly Amazon

book

Pramod Singh

data data-engineering apache-spark PySpark AI/ML Airflow

Leverage machine and deep learning models to build applications on real-time data using PySpark. This book is perfect for those who want to learn to use this language to perform exploratory data analysis and solve an array of business challenges. You'll start by reviewing PySpark fundamentals, such as Spark’s core architecture, and see how to use PySpark for big data processing like data ingestion, cleaning, and transformations techniques. This is followed by building workflows for analyzing streaming data using PySpark and a comparison of various streaming platforms. You'll then see how to schedule different spark jobs using Airflow with PySpark and book examine tuning machine and deep learning models for real-time predictions. This book concludes with a discussion on graph frames and performing network analysis using graph algorithms in PySpark. All the code presented in the book will be available in Python scripts on Github. What You'll Learn Develop pipelines for streaming data processing using PySpark Build Machine Learning & Deep Learning models using PySpark latest offerings Use graph analytics using PySpark Create Sequence Embeddings from Text data Who This Book is For Data Scientists, machine learning and deep learning engineers who want to learn and use PySpark for real time analysis on streaming data.

Real-Time Data Analytics for Large Scale Sensor Data

2019-08-31 O'Reilly Amazon

book

Himansu Das , Nilanjan Dey , Valentina Emilia Balas

data data-engineering streaming-messaging real-time-analytics Analytics Data Analytics

Real-Time Data Analytics for Large-Scale Sensor Data covers the theory and applications of hardware platforms and architectures, the development of software methods, techniques and tools, applications, governance and adoption strategies for the use of massive sensor data in real-time data analytics. It presents the leading-edge research in the field and identifies future challenges in this fledging research area. The book captures the essence of real-time IoT based solutions that require a multidisciplinary approach for catering to on-the-fly processing, including methods for high performance stream processing, adaptively streaming adjustment, uncertainty handling, latency handling, and more. Examines IoT applications, the design of real-time intelligent systems, and how to manage the rapid growth of the large volume of sensor data Discusses intelligent management systems for applications such as healthcare, robotics and environment modeling Provides a focused approach towards the design and implementation of real-time intelligent systems for the management of sensor data in large-scale environments

Advanced Elasticsearch 7.0

2019-08-23 O'Reilly Amazon

book

Wai Tak Wong

data data-engineering search elasticsearch AI/ML Analytics

Dive deep into the advanced capabilities of Elasticsearch 7.0 with this expert-level guide. In this book, you will explore the most effective techniques and tools for building, indexing, and querying advanced distributed search engines. Whether optimizing performance, scaling applications, or integrating with big data analytics, this guide empowers you with practical skills and insights. What this Book will help me do Master ingestion pipelines and preprocess documents for faster and more efficient indexing. Model search data optimally for complex and varied real-world applications. Perform exploratory data analyses using Elasticsearch's robust features. Integrate Elasticsearch with modern analytics platforms like Kibana and Logstash. Leverage Elasticsearch with Apache Spark and machine learning libraries for real-time advanced analytics. Author(s) None Wong is a seasoned Elasticsearch expert with years of real-world experience developing enterprise-grade search and analytics systems. With a passion for innovation and teaching, Wong enjoys breaking down complex technical concepts into digestible learning experiences. His work reflects a pragmatic and results-driven approach to teaching Elasticsearch. Who is it for? This book is ideal for Elasticsearch developers and data engineers with some prior experience who are looking to elevate their skills to an advanced level. It suits professionals seeking to enhance their expertise in building scalable search and analytics solutions. If you aim to master sophisticated Elasticsearch operations and real-time integrations, this book is tailored for you.

Data Warehousing with Greenplum, 2nd Edition

2019-07-25 O'Reilly Amazon

book

Marshall Presser

data data-engineering storage-repositories data-warehouse Analytics Data Analytics

Data professionals are confronting the most disruptive change since relational databases appeared in the 1980s. SQL is still a major tool for data analytics, but conventional relational database management systems can’t handle the increasing size and complexity of today’s datasets. This updated edition teaches you best practices for Greenplum Database, the open source massively parallel processing (MPP) database that accommodates large sets of nonrelational and relational data. Marshall Presser, field CTO at Pivotal, introduces Greenplum’s approach to data analytics and data-driven decisions, beginning with its shared-nothing architecture. IT managers, developers, data analysts, system architects, and data scientists will all gain from exploring data organization and storage, data loading, running queries, and learning to perform analytics in the database. Discover how MPP and Greenplum will help you go beyond the traditional data warehouse. This ebook covers: Greenplum features, use case examples, and techniques for optimizing use Four Greenplum deployment options to help you balance security, cost, and time to usability Why each networked node in Greenplum’s architecture includes an independent operating system, memory, and storage Additional tools for monitoring, managing, securing, and optimizing query responses in the Pivotal Greenplum commercial database

Operationalizing the Data Lake

2019-07-25 O'Reilly Amazon

book

Jon King , Holden Ackerman

data data-engineering storage-repositories data-lake Analytics Big Data

Big data and advanced analytics have increasingly moved to the cloud as organizations pursue actionable insights and data-driven products using the growing amounts of information they collect. But few companies have truly operationalized data so it’s usable for the entire organization. With this pragmatic ebook, engineers, architects, and data managers will learn how to build and extract value from a data lake in the cloud and leverage the compute power and scalability of a cloud-native data platform to put your company’s vast data trove into action. Holden Ackerman and Jon King of Qubole take you through the basics of building a data lake operation, from people to technology, employing multiple technologies and frameworks in a cloud-native data platform. You'll dive into the tools and processes you need for the entire lifecycle of a data lake, from data preparation, storage, and management to distributed computing and analytics. You’ll also explore the unique role that each member of your data team needs to play as you migrate to your cloud-native data platform. Leverage your data effectively through a single source of truth Understand the importance of building a self-service culture for your data lake Define the structure you need to build a data lake in the cloud Implement financial governance and data security policies for your data lake through a cloud-native data platform Identify the tools you need to manage your data infrastructure Delineate the scope, usage rights, and best tools for each team working with a data lake—analysts, data scientists, data engineers, and security professionals, among others

IBM Spectrum Scale: Big Data and Analytics Solution Brief

2019-07-17 O'Reilly Amazon

book

Wei G. Gong , Sandeep R Patil

data data-engineering IBM Analytics Big Data Cloud Computing

This IBM® Redguide™ publication describes big data and analytics deployments that are built on IBM Spectrum Scale™. IBM Spectrum Scale is a proven enterprise-level distributed file system that is a high-performance and cost-effective alternative to Hadoop Distributed File System (HDFS) for Hadoop analytics services. IBM Spectrum Scale includes NFS, SMB, and Object services and meets the performance that is required by many industry workloads, such as technical computing, big data, analytics, and content management. IBM Spectrum Scale provides world-class, web-based storage management with extreme scalability, flash accelerated performance, and automatic policy-based storage tiering from flash through disk to the cloud, which reduces storage costs up to 90% while improving security and management efficiency in cloud, big data, and analytics environments. This Redguide publication is intended for technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing Hadoop analytics services and are interested in learning about the benefits of the use of IBM Spectrum Scale as an alternative to HDFS.

IBM FlashSystem 900 Model AE3 Product Guide

2019-07-09 O'Reilly Amazon

book

Eike Schenk , Jon Herd , Detlef Helmbrecht

data data-engineering IBM Analytics Cloud Computing

Today's global organizations depend on the ability to unlock business insights from massive volumes of data. Now, with IBM® FlashSystem 900 Model AE3, they can make faster decisions based on real-time insights. Thus, they unleash the power of demanding applications, including these: Online transaction processing (OLTP) and analytical databases Virtual desktop infrastructures (VDIs) Technical computing applications Cloud environments Easy to deploy and manage, IBM FlashSystem® 900 Model AE3 is designed to accelerate the applications that drive your business. Powered by IBM FlashCore® Technology, IBM FlashSystem Model AE3 provides the following characteristics: Accelerate business-critical workloads, real-time analytics, and cognitive applications with the consistent microsecond latency and extreme reliability of IBM FlashCore technology Improve performance and help lower cost with new inline data compression Help reduce capital and operational expenses with IBM enhanced 3D triple-level cell (3D TLC) flash Protect critical data assets with patented IBM Variable Stripe RAID™ Power faster insights with IBM FlashCore including hardware-accelerated nonvolatile memory (NVM) architecture, purpose-engineered IBM MicroLatency® modules and advanced flash management FlashSystem 900 Model AE3 can be configured in capacity points as low as 14.4 TB to 180 TB usable and up to 360 TB effective capacity after RAID 5 protection and compression. You can couple this product with either 16 Gbps, 8 Gbps Fibre Channel, 16 Gbps NVMe over Fibre Channel, or 40 Gbps InfiniBand connectivity. Thus, the IBM FlashSystem 900 Model AE3 provides extreme performance to existing and next generation infrastructure.

Streaming Data

2019-06-25 O'Reilly Amazon

book

Andy Oram

data data-engineering streaming-messaging streaming-architecture AI/ML Analytics

Managers and staff responsible for planning, hiring, and allocating resources need to understand how streaming data can fundamentally change their organizations. Companies everywhere are disrupting business, government, and society by using data and analytics to shape their business. Even if you don’t have deep knowledge of programming or digital technology, this high-level introduction brings data streaming into focus. You won’t find math or programming details here, or recommendations for particular tools in this rapidly evolving space. But you will explore the decision-making technologies and practices that organizations need to process streaming data and respond to fast-changing events. By describing the principles and activities behind this new phenomenon, author Andy Oram shows you how streaming data provides hidden gems of information that can transform the way your business works. Learn where streaming data comes from and how companies put it to work Follow a simple data processing project from ingesting and analyzing data to presenting results Explore how (and why) big data processing tools have evolved from MapReduce to Kubernetes Understand why streaming data is particularly useful for machine learning projects Learn how containers, microservices, and cloud computing led to continuous integration and DevOps

Managing Your Data Science Projects: Learn Salesmanship, Presentation, and Maintenance of Completed Models

2019-06-07 O'Reilly Amazon

book

Robert de Graaf

data data-engineering data-models Analytics Data Science

At first glance, the skills required to work in the data science field appear to be self-explanatory. Do not be fooled. Impactful data science demands an interdisciplinary knowledge of business philosophy, project management, salesmanship, presentation, and more. In Managing Your Data Science Projects, author Robert de Graaf explores important concepts that are frequently overlooked in much of the instructional literature that is available to data scientists new to the field. If your completed models are to be used and maintained most effectively, you must be able to present and sell them within your organization in a compelling way. The value of data science within an organization cannot be overstated. Thus, it is vital that strategies and communication between teams are dexterously managed. Three main ways that data science strategy is used in a company is to research its customers, assess risk analytics, and log operational measurements. These all require different managerial instincts, backgrounds, and experiences, and de Graaf cogently breaks down the unique reasons behind each. They must align seamlessly to eventually be adopted as dynamic models. Data science is a relatively new discipline, and as such, internal processes for it are not as well-developed within an operational business as others. With Managing Your Data Science Projects, you will learn how to create products that solve important problems for your customers and ensure that the initial success is sustained throughout the product’s intended life. Your users will trust you and your models, and most importantly, you will be a more well-rounded and effectual data scientist throughout your career. Who This Book Is For Early-career data scientists, managers of data scientists, and those interested in entering the fieldof data science

Stream Processing with Apache Spark

2019-06-05 O'Reilly Amazon

book

Francois Garillot , Gerard Maas

data data-engineering apache-spark AI/ML Analytics Flink

Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables you to write streaming jobs in almost the same way you write batch jobs. Authors Gerard Maas and François Garillot help you explore the theoretical underpinnings of Apache Spark. This comprehensive guide features two sections that compare and contrast the streaming APIs Spark now supports: the original Spark Streaming library and the newer Structured Streaming API. Learn fundamental stream processing concepts and examine different streaming architectures Explore Structured Streaming through practical examples; learn different aspects of stream processing in detail Create and operate streaming jobs and applications with Spark Streaming; integrate Spark Streaming with other Spark APIs Learn advanced Spark Streaming techniques, including approximation algorithms and machine learning algorithms Compare Apache Spark to other stream processing projects, including Apache Storm, Apache Flink, and Apache Kafka Streams

Learning Elastic Stack 7.0 - Second Edition

2019-05-31 O'Reilly Amazon

book

Sharath Kumar , Pranav Shukla

data data-engineering search elasticsearch elastic-stack-elk-stack elastic stack (elk stack)

"Learning Elastic Stack 7.0" introduces you to the tools and techniques of Elastic Stack, covering Elasticsearch, Logstash, Beats, and Kibana. With clear explanations and practical examples, this book helps you grasp the 7.0 version's new features and capabilities, empowering you to build and deploy robust, real-time data processing applications. What this Book will help me do Gain the necessary skills to install and configure Elastic Stack for professional use. Master the data handling capabilities of Elasticsearch for distributed search and analytics. Develop expertise in creating data pipelines with Logstash and other ingestion tools. Learn to utilize Kibana to visualize and interpret complex datasets. Acquire knowledge of deploying Elastic Stack solutions both on-premise and in cloud environments. Author(s) Pranav Shukla and Sharath Kumar M N are experienced software engineers and data professionals with a profound knowledge of databases, distributed systems, and cloud architectures. They specialize in educating developers through structured guidance and proven methodologies related to data handling and visualization. Who is it for? This book is designed for software engineers, data analysts, and technical architects interested in learning the Elastic Stack tools from the ground up. Readers familiar with database concepts but new to Elastic Stack will find this book particularly helpful. Advanced users seeking to understand the updates in Elastic Stack 7.0 are also a complementary audience. If you wish to apply Elastic Stack to real-time data processing and analytics, this book provides a strong foundation.

Data Architecture: A Primer for the Data Scientist, 2nd Edition

2019-04-30 O'Reilly Amazon

book

Mary Levins , Daniel Linstedt , W. H. Inmon

data data-engineering Analytics Big Data Data Science DWH

Over the past 5 years, the concept of big data has matured, data science has grown exponentially, and data architecture has become a standard part of organizational decision-making. Throughout all this change, the basic principles that shape the architecture of data have remained the same. There remains a need for people to take a look at the "bigger picture" and to understand where their data fit into the grand scheme of things. Data Architecture: A Primer for the Data Scientist, Second Edition addresses the larger architectural picture of how big data fits within the existing information infrastructure or data warehousing systems. This is an essential topic not only for data scientists, analysts, and managers but also for researchers and engineers who increasingly need to deal with large and complex sets of data. Until data are gathered and can be placed into an existing framework or architecture, they cannot be used to their full potential. Drawing upon years of practical experience and using numerous examples and case studies from across various industries, the authors seek to explain this larger picture into which big data fits, giving data scientists the necessary context for how pieces of the puzzle should fit together. New case studies include expanded coverage of textual management and analytics New chapters on visualization and big data Discussion of new visualizations of the end-state architecture

Elasticsearch 7.0 Cookbook - Fourth Edition

2019-04-30 O'Reilly Amazon

book

Alberto Paro

data data-engineering search elasticsearch Analytics Big Data

"Elasticsearch 7.0 Cookbook" is a practical guide to effectively using Elasticsearch, packed with over 100 recipes that cover everything from simple setup tasks to advanced query creation. Whether you're deploying Elasticsearch nodes or integrating with various technologies, this book will empower you to make the most out of Elasticsearch's robust search capabilities. What this Book will help me do Understand how to efficiently deploy and manage Elasticsearch architectures within your enterprise. Learn to create and optimize queries for effective analytics and data retrieval. Explore advanced indexing and mapping techniques to enhance data searchability. Monitor and scale your Elasticsearch clusters to ensure optimal performance. Integrate Elasticsearch with programming languages and big data applications. Author(s) Alberto Paro, a seasoned Elasticsearch expert, brings years of experience in designing and implementing large-scale search and analytics solutions. His practical experience in guiding teams through complex Elasticsearch deployments is evident in his clear and solution-focused writing approach. Alberto's passion for technology drives his mission to make advanced technical topics accessible. Who is it for? This book is ideal for software engineers, data professionals, and Elasticsearch developers who are looking to expand their technical capabilities in search and data analytics. It is also suited for individuals in industries like e-commerce utilizing Elastic for insights. A basic understanding of Elasticsearch will allow readers to gain deeper value from this book.

Data Science and Engineering at Enterprise Scale

2019-04-25 O'Reilly Amazon

book

Jerome Nilmeier

data data-science AI/ML Analytics Data Science Python

As enterprise-scale data science sharpens its focus on data-driven decision making and machine learning, new tools have emerged to help facilitate these processes. This practical ebook shows data scientists and enterprise developers how the notebook interface, Apache Spark, and other collaboration tools are particularly well suited to bridge the communication gap between their teams. Through a series of real-world examples, author Jerome Nilmeier demonstrates how to generate a model that enables data scientists and developers to share ideas and project code. You’ll learn how data scientists can approach real-world business problems with Spark and how developers can then implement the solution in a production environment. Dive deep into data science technologies, including Spark, TensorFlow, and the Jupyter Notebook Learn how Spark and Python notebooks enable data scientists and developers to work together Explore how the notebook environment works with Spark SQL for structured data Use notebooks and Spark as a launchpad to pursue supervised, unsupervised, and deep learning data models Learn additional Spark functionality, including graph analysis and streaming Explore the use of analytics in the production environment, particularly when creating data pipelines and deploying code

Implementing IBM FlashSystem 900 Model AE3

2019-04-12 O'Reilly Amazon

book

Katja Kratt , Eike Schenk , Christian Karpp , Jon Herd , Detlef Helmbrecht , Jim Cioffi , David Gimpl

data data-engineering IBM Analytics Cloud Computing

Today's global organizations depend on being able to unlock business insights from massive volumes of data. Now, with IBM® FlashSystem 900 Model AE3 that is powered by IBM FlashCore® technology, they can make faster decisions that are based on real-time insights. They also can unleash the power of the most demanding applications, including online transaction processing (OLTP) and analytics databases, virtual desktop infrastructures (VDIs), technical computing applications, and cloud environments. This IBM Redbooks® publication introduces clients to the IBM FlashSystem® 900 Model AE3. It provides in-depth knowledge of the product architecture, software and hardware, implementation, and hints and tips. Also presented are use cases that show real-world solutions for tiering, flash-only, and preferred-read. Examples of the benefits that are gained by integrating the FlashSystem storage into business environments also are described. This book is intended for pre-sales and post-sales technical support professionals and storage administrators, and anyone who wants to understand how to implement this new and exciting technology.

Stream Processing with Apache Flink

2019-04-11 O'Reilly Amazon

book

Vasiliki Kalavri , Fabian Hueske

data data-engineering streaming-messaging streaming & messaging Analytics Flink

Get started with Apache Flink, the open source framework that powers some of the world’s largest stream processing applications. With this practical book, you’ll explore the fundamental concepts of parallel stream processing and discover how this technology differs from traditional batch data processing. Longtime Apache Flink committers Fabian Hueske and Vasia Kalavri show you how to implement scalable streaming applications with Flink’s DataStream API and continuously run and maintain these applications in operational environments. Stream processing is ideal for many use cases, including low-latency ETL, streaming analytics, and real-time dashboards as well as fraud detection, anomaly detection, and alerting. You can process continuous data of any kind, including user interactions, financial transactions, and IoT data, as soon as you generate them. Learn concepts and challenges of distributed stateful stream processing Explore Flink’s system architecture, including its event-time processing mode and fault-tolerance model Understand the fundamentals and building blocks of the DataStream API, including its time-based and statefuloperators Read data from and write data to external systems with exactly-once consistency Deploy and configure Flink clusters Operate continuously running streaming applications

Hands-On Big Data Analytics with PySpark

2019-03-29 O'Reilly Amazon

book

Bartłomiej Potaczek , Rudy Lai

data data-engineering apache-spark PySpark Analytics Big Data

Dive into the exciting world of big data analytics with 'Hands-On Big Data Analytics with PySpark'. This practical guide offers you the tools and knowledge to tackle massive datasets using PySpark. By exploring real-world examples, you'll learn to unleash the power of distributed systems to analyze and manipulate data at scale. What this Book will help me do Master using PySpark to handle large and complex datasets efficiently and effectively. Develop skills to optimize Spark programs using best practices like reducing shuffle operations. Learn to set up a PySpark environment, process data from platforms like HDFS, Hive, and S3. Enhance your data analytics capabilities by implementing powerful SQL queries and data visualizations. Understand testing and debugging techniques to build reliable, production-quality data pipelines. Author(s) Authored by Rudy Lai and Bartłomiej Potaczek, both seasoned data engineers and authors in the big data field. Rudy and Bartłomiej bring their extensive experience working with distributed systems and scalable data architectures into this book. Their approach is hands-on, focusing on real-world applications and best practices. Who is it for? This book is tailored for data scientists, engineers, and developers eager to advance their big data analytics capabilities. Whether you're new to big data or experienced with other analytics frameworks, this book will equip you with practical knowledge to utilize PySpark for scalable data solutions.

Data Lake Maturity Model

2019-03-25 O'Reilly Amazon

book

Scott Gidley , Andy Oram

data data-engineering storage-repositories data-lake Analytics Big Data

Data is changing everything. Many industries today are being fundamentally transformed through the accumulation and analysis of large quantities of data, stored in diversified but flexible repositories known as data lakes. Whether your company has just begun to think about big data or has already initiated a strategy for handling it, this practical ebook shows you how to plan a successful data lake migration. You’ll learn the value of data lakes, their structure, and the problems they attempt to solve. Using Zaloni’s data lake maturity model, you’ll then explore your organization’s readiness for putting a data lake into action. Do you have the tools and data architectures to support big data analysis? Are your people and processes prepared? The data lake maturity model will help you rate your organization’s readiness. This report includes: The structure and purpose of a data lake Descriptive, predictive, and prescriptive analytics Data lake curation, self-service, and the use of data lake zones How to rate your organization using the data lake maturity model A complete checklist to help you determine your strategic path forward

AI and Big Data on IBM Power Systems Servers

2019-03-22 O'Reilly Amazon

book

Rafael Freitas de Lima Ivaylo B. Bozhinov Scott Vetter Anto A John Ahmed. Mashhour, James Van Oosten, Fernando Vermelho, Allison White

data data-engineering IBM ibm-power-systems AI/ML Analytics

Abstract As big data becomes more ubiquitous, businesses are wondering how they can best leverage it to gain insight into their most important business questions. Using machine learning (ML) and deep learning (DL) in big data environments can identify historical patterns and build artificial intelligence (AI) models that can help businesses to improve customer experience, add services and offerings, identify new revenue streams or lines of business (LOBs), and optimize business or manufacturing operations. The power of AI for predictive analytics is being harnessed across all industries, so it is important that businesses familiarize themselves with all of the tools and techniques that are available for integration with their data lake environments. In this IBM® Redbooks® publication, we cover the best practices for deploying and integrating some of the best AI solutions on the market, including: IBM Watson Machine Learning Accelerator (see note for product naming) IBM Watson Studio Local IBM Power Systems™ IBM Spectrum™ Scale IBM Data Science Experience (IBM DSX) IBM Elastic Storage™ Server Hortonworks Data Platform (HDP) Hortonworks DataFlow (HDF) H2O Driverless AI We map out all the integrations that are possible with our different AI solutions and how they can integrate with your existing or new data lake. We also walk you through some of our client use cases and show you how some of the industry leaders are using Hortonworks, IBM PowerAI, and IBM Watson Studio Local to drive decision making. We also advise you on your deployment options, when to use a GPU, and why you should use the IBM Elastic Storage Server (IBM ESS) to improve storage management. Lastly, we describe how to integrate IBM Watson Machine Learning Accelerator and Hortonworks with or without IBM Watson Studio Local, how to access real-time data, and security. Note: IBM Watson Machine Learning Accelerator is the new product name for IBM PowerAI Enterprise. Note: Hortonworks merged with Cloudera in January 2019. The new company is called Cloudera. References to Hortonworks as a business entity in this publication are now referring to the merged company. Product names beginning with Hortonworks continue to be marketed and sold under their original names.

IBM DS8880 Architecture and Implementation (Release 8.51)

2019-02-26 O'Reilly Amazon

book

Sherry Brunson Bert Dufrasne Peter Kimmel, Stephen Manthorpe, Andreas Reinhardt, Connie Riggins, Tamas Toser, Axel Westphal

data data-engineering IBM Analytics Cloud Computing

Abstract * Updated for R8.51 * This IBM® Redbooks® publication describes the concepts, architecture, and implementation of the IBM DS8880 family. The book provides reference information to assist readers who need to plan for, install, and configure the DS8880 systems. The IBM DS8000® family is a high-performance, high-capacity, highly secure, and resilient series of disk storage systems. The DS8880 family is the latest and most advanced of the DS8000 offerings to date. The high availability, multiplatform support, including IBM Z, and simplified management tools help provide a cost-effective path to an on-demand and cloud-based infrastructures. The IBM DS8880 family now offers business-critical, all-flash, and hybrid data systems that span a wide range of price points: DS8882F: Rack Mounted storage system DS8884: Business Class DS8886: Enterprise Class DS8888: Analytics Class The DS8884 and DS8886 are available as either hybrid models, or can be configured as all-flash. Each model represents the most recent in this series of high-performance, high-capacity, flexible, and resilient storage systems. These systems are intended to address the needs of the most demanding clients. Two powerful IBM POWER8® processor-based servers manage the cache to streamline disk I/O, maximizing performance and throughput. These capabilities are further enhanced with the availability of the second generation of high-performance flash enclosures (HPFEs Gen-2) and newer flash drives. Like its predecessors, the DS8880 supports advanced disaster recovery (DR) solutions, business continuity solutions, and thin provisioning. All disk drives in the DS8880 storage system include the Full Disk Encryption (FDE) feature. The DS8880 can automatically optimize the use of each storage tier, particularly flash drives, by using the IBM Easy Tier® feature. Release 8.5 introduces the Safeguarded Copy feature. The DS8882F Rack Mounted is decribed in a separate publication, Introducing the IBM DS8882F Rack Mounted Storage System, REDP-5505.

Dynamic SQL: Applications, Performance, and Security in Microsoft SQL Server

2018-12-27 O'Reilly Amazon

book

Edward Pollack

data data-engineering relational-databases microsoft-sql-server Analytics BI

Take a deep dive into the many uses of dynamic SQL in Microsoft SQL Server. This edition has been updated to use the newest features in SQL Server 2016 and SQL Server 2017 as well as incorporating the changing landscape of analytics and database administration. Code examples have been updated with new system objects and functions to improve efficiency and maintainability. Executing dynamic SQL is key to large-scale searching based on user-entered criteria. Dynamic SQL can generate lists of values and even code with minimal impact on performance. Dynamic SQL enables dynamic pivoting of data for business intelligence solutions as well as customizing of database objects. Yet dynamic SQL is feared by many due to concerns over SQL injection or code maintainability. Dynamic SQL: Applications, Performance, and Security in Microsoft SQL Server helps you bring the productivity and user-satisfaction of flexible and responsive applications to your organization safely and securely. Your organization’s increased ability to respond to rapidly changing business scenarios will build competitive advantage in an increasingly crowded and competitive global marketplace. With a focus on new applications and modern database architecture, this edition illustrates that dynamic SQL continues to evolve and be a valuable tool for administration, performance optimization, and analytics. What You'ill Learn Build flexible applications that respond to changing business needs Take advantage of creative, innovative, and productive uses of dynamic SQL Know about SQL injection and be confident in your defenses against it Address performance concerns in stored procedures and dynamic SQL Troubleshoot and debug dynamic SQL to ensure correct results Automate your administration of features within SQL Server Who This Book is For Developers and database administrators looking to hone and build their T-SQL coding skills. The book is ideal for developers wanting to plumb the depths of application flexibility and troubleshoot performance issues involving dynamic SQL. The book is also ideal for programmers wanting to learn what dynamic SQL is about and how it can help them deliver competitive advantage to their organizations.

Machine Learning with Apache Spark Quick Start Guide

2018-12-26 O'Reilly Amazon

book

Jillur Quddus

data data-engineering apache-spark AI/ML Analytics Big Data

"Machine Learning with Apache Spark Quick Start Guide" introduces you to the fundamental concepts and tools needed to harness the power of Apache Spark for data processing and machine learning. This book combines practical examples and real-world scenarios to show you how to manage big data efficiently while uncovering actionable insights through advanced analytics. What this Book will help me do Understand the role of Apache Spark in the big data ecosystem. Set up and configure an Apache Spark development environment. Learn and implement supervised and unsupervised learning models using Spark MLlib. Apply advanced analytical algorithms to real-world big data problems. Develop and deploy real-time machine learning pipelines with Apache Spark. Author(s) None Quddus is an experienced practitioner in the fields of big data, distributed technologies, and machine learning. With a career dedicated to using advanced analytics to solve real-world problems, Quddus brings practical expertise to each topic addressed. Their approachable writing style ensures readers can apply concepts effectively, even in complex scenarios. Who is it for? This book is ideal for business analysts, data analysts, and data scientists who are eager to gain hands-on experience with big data technologies. Whether you are new to Apache Spark or looking to expand your knowledge of its machine learning capabilities, this guide provides the tools and insights necessary to achieve those goals. Technical professionals wanting to develop their skills in processing and analyzing big data will find this resource invaluable.

Apache Spark 2: Data Processing and Real-Time Analytics

2018-12-21 O'Reilly Amazon

book

Romeo Kienzler , Sridhar Alla , Md. Rezaul Karim , Siamak Amirghodsi

data data-engineering apache-spark AI/ML Analytics Big Data

Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing framework Key Features Master the art of real-time big data processing and machine learning Explore a wide range of use-cases to analyze large data Discover ways to optimize your work by using many features of Spark 2.x and Scala Book Description Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform. You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools. By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle. This Learning Path includes content from the following Packt products: Mastering Apache Spark 2.x by Romeo Kienzler Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook What you will learn Get to grips with all the features of Apache Spark 2.x Perform highly optimized real-time big data processing Use ML and DL techniques with Spark MLlib and third-party tools Analyze structured and unstructured data using SparkSQL and GraphX Understand tuning, debugging, and monitoring of big data applications Build scalable and fault-tolerant streaming applications Develop scalable recommendation engines Who this book is for If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

IBM Power Systems Enterprise AI Solutions

IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences

Learn PySpark: Build Python-based Machine Learning and Deep Learning Models

Real-Time Data Analytics for Large Scale Sensor Data

Advanced Elasticsearch 7.0

Data Warehousing with Greenplum, 2nd Edition

Operationalizing the Data Lake

IBM Spectrum Scale: Big Data and Analytics Solution Brief

IBM FlashSystem 900 Model AE3 Product Guide

Streaming Data

Managing Your Data Science Projects: Learn Salesmanship, Presentation, and Maintenance of Completed Models

Stream Processing with Apache Spark

Learning Elastic Stack 7.0 - Second Edition

Data Architecture: A Primer for the Data Scientist, 2nd Edition

Elasticsearch 7.0 Cookbook - Fourth Edition

Data Science and Engineering at Enterprise Scale

Implementing IBM FlashSystem 900 Model AE3

Stream Processing with Apache Flink

Hands-On Big Data Analytics with PySpark

Data Lake Maturity Model

AI and Big Data on IBM Power Systems Servers

IBM DS8880 Architecture and Implementation (Release 8.51)

Dynamic SQL: Applications, Performance, and Security in Microsoft SQL Server

Machine Learning with Apache Spark Quick Start Guide

Apache Spark 2: Data Processing and Real-Time Analytics