O'Reilly Data Engineering Books

Highly Efficient Data Access with RoCE on IBM Elastic Storage Systems and IBM Spectrum Scale

2022-02-18 O'Reilly Amazon

book

Piyush Chaudhary , Gero Schmidt , Olaf Weiser

data data-engineering IBM ELK

With Remote Direct Memory Access (RDMA), you can make a subset of a host's memory directly available to a remote host. RDMA is available on standard Ethernet-based networks by using the RDMA over Converged Ethernet (RoCE) interface. The RoCE network protocol is an industry-standard initiative by the InfiniBand Trade Association. This IBM® Redpaper publication describes how to set up RoCE to use within an IBM Spectrum® Scale cluster and IBM Elastic Storage® Systems (ESSs). This book is targeted at technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for delivering cost-effective storage solutions with IBM Spectrum Scale and IBM ESSs.

Kafka in Action

2022-02-13 O'Reilly Amazon

book

Dave Klein , Dylan Scott , VIKTOR GAMOV

data data-engineering streaming-messaging Kafka Analytics ETL/ELT

Master the wicked-fast Apache Kafka streaming platform through hands-on examples and real-world projects. In Kafka in Action you will learn: Understanding Apache Kafka concepts Setting up and executing basic ETL tasks using Kafka Connect Using Kafka as part of a large data project team Performing administrative tasks Producing and consuming event streams Working with Kafka from Java applications Implementing Kafka as a message queue Kafka in Action is a fast-paced introduction to every aspect of working with Apache Kafka. Starting with an overview of Kafka's core concepts, you'll immediately learn how to set up and execute basic data movement tasks and how to produce and consume streams of events. Advancing quickly, you’ll soon be ready to use Kafka in your day-to-day workflow, and start digging into even more advanced Kafka topics. About the Technology Think of Apache Kafka as a high performance software bus that facilitates event streaming, logging, analytics, and other data pipeline tasks. With Kafka, you can easily build features like operational data monitoring and large-scale event processing into both large and small-scale applications. About the Book Kafka in Action introduces the core features of Kafka, along with relevant examples of how to use it in real applications. In it, you’ll explore the most common use cases such as logging and managing streaming data. When you’re done, you’ll be ready to handle both basic developer- and admin-based tasks in a Kafka-focused team. What's Inside Kafka as an event streaming platform Kafka producers and consumers from Java applications Kafka as part of a large data project About the Reader For intermediate Java developers or data engineers. No prior knowledge of Kafka required. About the Authors Dylan Scott is a software developer in the insurance industry. Viktor Gamov is a Kafka-focused developer advocate. At Confluent, Dave Klein helps developers, teams, and enterprises harness the power of event streaming with Apache Kafka. Quotes The authors have had many years of real-world experience using Kafka, and this book’s on-the-ground feel really sets it apart. - From the foreword by Jun Rao, Confluent Cofounder A surprisingly accessible introduction to a very complex technology. Developers will want to keep a copy close by. - Conor Redmond, InComm Payments A comprehensive and practical guide to Kafka and the ecosystem. - Sumant Tambe, Linkedin It quickly gave me insight into how Kafka works, and how to design and protect distributed message applications. - Gregor Rayman, Cloudfarms

PHP & MySQL: Novice to Ninja, 7th Edition

2022-02-10 O'Reilly Amazon

book

Tom Butler

data data-engineering relational-databases MySQL SQL

PHP & MySQL: Novice to Ninja, 7th Edition is a hands-on guide to learning all the tools, principles, and techniques needed to build a professional web application using PHP & MySQL. Comprehensively updated to cover PHP 8 and modern best practice, this highly practical and fun book covers everything from installation through to creating a complete online content management system. Gain a thorough understanding of PHP syntax Master database design principles and SQL Write robust, maintainable, best practice code Build a working content management system (CMS) And much more!

Data Privacy

2022-02-05 O'Reilly Amazon

book

Nishant Bhajaria

data data-engineering data-security-privacy data security & privacy Data Governance Fabric

Engineer privacy into your systems with these hands-on techniques for data governance, legal compliance, and surviving security audits. In Data Privacy you will learn how to: Classify data based on privacy risk Build technical tools to catalog and discover data in your systems Share data with technical privacy controls to measure reidentification risk Implement technical privacy architectures to delete data Set up technical capabilities for data export to meet legal requirements like Data Subject Asset Requests (DSAR) Establish a technical privacy review process to help accelerate the legal Privacy Impact Assessment (PIA) Design a Consent Management Platform (CMP) to capture user consent Implement security tooling to help optimize privacy Build a holistic program that will get support and funding from the C-Level and board Data Privacy teaches you to design, develop, and measure the effectiveness of privacy programs. You’ll learn from author Nishant Bhajaria, an industry-renowned expert who has overseen privacy at Google, Netflix, and Uber. The terminology and legal requirements of privacy are all explained in clear, jargon-free language. The book’s constant awareness of business requirements will help you balance trade-offs, and ensure your user’s privacy can be improved without spiraling time and resource costs. About the Technology Data privacy is essential for any business. Data breaches, vague policies, and poor communication all erode a user’s trust in your applications. You may also face substantial legal consequences for failing to protect user data. Fortunately, there are clear practices and guidelines to keep your data secure and your users happy. About the Book Data Privacy: A runbook for engineers teaches you how to navigate the trade-offs between strict data security and real world business needs. In this practical book, you’ll learn how to design and implement privacy programs that are easy to scale and automate. There’s no bureaucratic process—just workable solutions and smart repurposing of existing security tools to help set and achieve your privacy goals. What's Inside Classify data based on privacy risk Set up capabilities for data export that meet legal requirements Establish a review process to accelerate privacy impact assessment Design a consent management platform to capture user consent About the Reader For engineers and business leaders looking to deliver better privacy. About the Author Nishant Bhajaria leads the Technical Privacy and Strategy teams for Uber. His previous roles include head of privacy engineering at Netflix, and data security and privacy at Google. Quotes I wish I had had this text in 2015 or 2016 at Netflix, and it would have been very helpful in 2008–2012 in a time of significant architectural evolution of our technology. - From the Foreword by Neil Hunt, Former CPO, Netflix Your guide to building privacy into the fabric of your organization. - John Tyler, JPMorgan Chase The most comprehensive resource you can find about privacy. - Diego Casella, InvestSuite Offers some valuable insights and direction for enterprises looking to improve the privacy of their data. - Peter White, Charles Sturt University

IBM FlashSystem Best Practices and Performance Guidelines for IBM Spectrum Virtualize Version 8.4.2

2022-02-02 O'Reilly Amazon

book

Carlton Beatty , Nils Olsson , Konrad Trojok , David Green , Vasfi Gucer , Hartmut Lonzer , Mandy Stevens , Uwe Schreiber , Renato Santos , Rene Oehme , Kendall Williams , Sergey Kubin , Jonathan Wilkie , Thales Noivo Ferreira , Nezih Boyacıoglu , Antonio Rainero

data data-engineering IBM

This IBM® Redbooks® publication captures several of the preferred practices and describes the performance gains that can be achieved by implementing the IBM FlashSystem® products that are powered by IBM Spectrum® Virtualize Version 8.4.2. These practices are based on field experience. This book highlights configuration guidelines and preferred practices for the storage area network (SAN) topology, clustered system, back-end storage, storage pools and managed disks, volumes, Remote Copy services, and hosts. It explains how you can optimize disk performance with the IBM System Storage Easy Tier® function. It also provides preferred practices for monitoring, maintaining, and troubleshooting. This book is intended for experienced storage, SAN, IBM FlashSystem, SAN Volume Controller, and IBM Storwize® administrators and technicians. Understanding this book requires advanced knowledge of these environments.

Why External Data Needs to Be Part of Your Data and Analytics Strategy

2022-01-25 O'Reilly Amazon

book

Joseph D. Stec

data data-engineering Analytics Data Management

Innovative organizations today are reaping the benefits of combining data from a variety of internal and external sources. By collecting, storing, analyzing, and leveraging external data, these companies are able to outperform competitors by unlocking improvements in growth, productivity, and risk management. This report explains how you can harness the power of external data to boost analytics, find competitive advantages, and drive value. Author Joseph D. Stec explains how clever companies are now using advanced analytics tools that can simultaneously collect, mix, and match diverse data from disparate data sources. This enables them to improve products and brand loyalty, generate better conversions, identify trends earlier, and pinpoint additional ways to improve customer satisfaction. With this report, you will: Learn how external data elevates and enhances the way you analyze and interpret data outside of your apps or databases Dive into the nuts and bolts of external data platforms to solve key challenges Understand how new technology makes external data easier to use with analytics Learn how an external data platform fits into your data architecture Gain access to relevant external data signals with Explorium, an automated external data management platform Unlock improvements in growth, productivity, and risk management

Cassandra: The Definitive Guide, (Revised) Third Edition, 3rd Edition

2022-01-23 O'Reilly Amazon

book

Eben Hewitt , Jeff Carpenter

data data-engineering nosql-databases Cassandra Cloud Computing Data Modelling

Imagine what you could do if scalability wasn't a problem. With this hands-on guide, you'll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This revised third edition--updated for Cassandra 4.0 and new developments in the Cassandra ecosystem, including deployments in Kubernetes with K8ssandra--provides technical details and practical examples to help you put this database to work in a production environment. Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra's nonrelational design, with special attention to data modeling. Developers, DBAs, and application architects looking to solve a database scaling issue or future-proof an application will learn how to harness Cassandra's speed and flexibility. Understand Cassandra's distributed and decentralized structure Use the Cassandra Query Language (CQL) and cqlsh (the CQL shell) Create a working data model and compare it with an equivalent relational model Design and develop applications using client drivers Explore cluster topology and learn how nodes exchange data Maintain a high level of performance in your cluster Deploy Cassandra onsite, in the cloud, or with Docker and Kubernetes Integrate Cassandra with Spark, Kafka, Elasticsearch, Solr, and Lucene

Building Big Data Pipelines with Apache Beam

2022-01-21 O'Reilly Amazon

book

Jan Lukavský

data data-engineering streaming-messaging apache-beam Beam Big Data

Building Big Data Pipelines with Apache Beam is the essential guide for mastering data processing using Apache Beam. This book covers both the basics and advanced concepts, from implementing pipelines to extending functionalities with custom I/O connectors. By the end, you'll be equipped to build scalable and reusable big data solutions. What this Book will help me do Understand the core principles of Apache Beam and its architecture. Learn how to create efficient data processing pipelines for diverse scenarios. Master the use of stateful processing for real-time data handling. Gain skills in using Beam's portability features for various languages. Explore advanced functionalities like creating custom I/O connectors. Author(s) None Lukavský is a seasoned data engineer with extensive experience in big data technologies and Apache Beam. Having worked on innovative data solutions across industries, None brings hands-on insights and practical expertise to this book. Their approach to teaching ensures readers can directly apply concepts to real-world scenarios. Who is it for? This book is designed for professionals involved in big data, such as data engineers, analysts, and scientists. It is particularly suited for those with an intermediate level of understanding of Java, aiming to expand their skill set to include advanced data pipeline construction. Whether you're stepping into Apache Beam for the first time or looking to deepen your expertise, this book offers valuable, actionable insights.

IBM Storage Networking c-type FICON Implementation Guide

2022-01-11 O'Reilly Amazon

book

William White , Aubrey Applewhaite , Gavin O'Reilly , Fausto Vaninetti , Gary Fisher , Mike Blair , Lyle Ramsey

data data-engineering IBM

The next-generation IBM® c-type Directors and switches for IBM Storage Networking provides high-speed Fibre Channel (FC) and IBM Fibre Connection (IBM FICON®) connectivity from the IBM Z® platform to the storage area network (SAN) core. It enables enterprises to rapidly deploy high-density virtualized servers with the dual benefit of higher bandwidth and consolidation. This IBM Redpaper Redbooks publication helps administrators understand how to implement or migrate to an IBM c-type SAN environment. It provides an overview of the key hardware and software products, and it explains how to install, configure, monitor, tune, and troubleshoot your SAN environment.

SAN and Fabric Resiliency Best Practices for IBM b-type Products

2022-01-05 O'Reilly Amazon

book

Ian MacQuarrie , Gavin O'Reilly , David Green , Jim Blue , David Lutz

data data-engineering IBM Fabric

This IBM® Redpaper® publication describes best practices for deploying and using advanced Broadcom Fabric Operating System (FOS) features to identify, monitor, and protect Fibre Channel (FC) SANs from problematic devices and media behavior. Note that this paper primarily focuses on the FOS command options and features that are available since version 8.2 with some coverage of new features that were introduced in 9.0. This paper covers the following recent changes: SANnav Fabric Performance Impact Notification

Getting Started with IBM Hyper Protect Data Controller

2022-01-04 O'Reilly Amazon

book

Guillaume Hoareau , Bill White , Eva Yan , Philippe Richard , Jason Katonica , Roy Panting , Maxwell Weiss , Andy Coulson

data data-engineering IBM Cyber Security

IBM® Hyper Protect Data Controller is designed to provide privacy protection of your sensitive data and give ease of control and auditability. It can manage how data is shared securely through a central control. Hyper Protect Data Controller can protect data wherever it goes—security policies are kept and honored whenever the data is accessed and future data access can be revoked even after data leaves the system of record. This IBM Redbooks® publication can assist you with determining how to get started with IBM Hyper Protect Data Controller through a use case approach. It will help you plan for, install, tailor and configure the Hyper Protect Data Controller. It includes information about the following topics: Concepts and reference architecture Common use cases with implementation guidance and advice Implementation and policy examples Typical operational tasks for creating policies and preparing for audits Monitoring user activity and events This IBM Redbooks publication is written for IT Managers, IT Architects, Security Administrators, data owners, and data consumers.

Installing and Configuring IBM Db2 AI for IBM z/OS v1.4.0

2022-01-04 O'Reilly Amazon

book

Tim Hogan , Janet Figone , Lydia Parziale , Guanjun Cai

data data-engineering relational-databases ibm-db2 AI/ML Cloud Computing

Artificial intelligence (AI) enables computers and machines to mimic the perception, learning, problem-solving, and decision-making capabilities of the human mind. AI development is made possible by the availability of large amounts of data and the corresponding development and wide availability of computer systems that can process all that data faster and more accurately than humans can. What happens if you infuse AI with a world-class database management system, such as IBM Db2®? IBM® has done just that with Db2 AI for z/OS (Db2ZAI). Db2ZAI is built to infuse AI and data science to assist businesses in the use of AI to develop applications more easily. With Db2ZAI, the following benefits are realized: Data science functionality Better built applications Improved database performance (and DBA's time and efforts are saved) through simplification and automation of error reporting and routine tasks Machine learning (ML) optimizer to improve query access paths and reduce the need for manual tuning and query optimization Integrated data access that makes data available from various vendors including private cloud providers. This IBM Redpaper® publication helps to simplify your installation by tailoring and configuration of Db2 AI for z/OS®. It was written for system programmers, system administrators, and database administrators.

SAP Enterprise Portfolio and Project Management: A Guide to Implement, Integrate, and Deploy EPPM Solutions

2022-01-03 O'Reilly Amazon

book

Joseph Alexander Soosaimuthu

data data-engineering SAP BI

Learn the fundamentals of SAP Enterprise Project and Portfolio management Project Systems (PS), Portfolio and Project Management (PPM) and Commercial Project Management (CPM) and their integration with other SAP modules. This book covers various business scenarios from different industries including the public sector, engineering and construction, professional services, telecom, mining, chemical, and pharmaceutical. Author Joseph Alexander Soosaimuthu will help you understand common business challenges and pain areas faced in portfolio, program and project management, and will provide suitable recommendations to overcome these challenges. This book not only suggests solutions within SAP, but also provides workarounds or integrations with third-party tools based on various Industry-specific business requirements. SAP Portfolio and Project Management addresses commonly asked questions regarding SAP EPPM implementation and deployment, and conveys a framework to facilitate engagement and discussion with key stakeholders. This provides coverage of SAP on-premise solutions with ECC 6.08 and SAP PPM 6.1 deployed on the same client, as well as S/4 HANA On-Premise 2020 with integration to BPC and BI/W systems. Interface with other third-party schedule management, estimation, costing and forecasting applications are also covered in this book. After completing SAP Portfolio and Project Management, you will be able to implement SAP Enterprise Portfolio and Project Management based on industry best practices. For your reference, you’ll also gain a list of development objects and a functionality list by Industry, and a Fiori apps list for Enterprise Portfolio and Project Management (EPPM). What You Will Learn Understand the fundamentals of project, program and portfolio management within SAP EPPM Master the art of project forecasting and scheduling integrations with other SAP modules Gainknowledge of the different interface options for scheduling, estimation, costing and forecasting third party applications Learn EPPM industry best practices, and how to address industry-specific business challenges Leverage operational and strategic reporting within EPPM Who This Book For Functional consultants and business analysts who are involved in SAP EPPM (PS, PPM and CPM) deployment and clients who are interested and are in the process of having SAP EPPM deployed for their Enterprise.

Data Engineering with AWS

2021-12-29 O'Reilly Amazon

book

Gareth Eagar

data data-engineering AI/ML Athena AWS Big Data

Discover how to effectively build and manage data engineering pipelines using AWS with "Data Engineering with AWS". In this hands-on book, you'll explore the foundational principles of data engineering, learn to architect data pipelines, and work with essential AWS services to process, transform, and analyze data. What this Book will help me do Understand and implement modern data engineering pipelines with AWS services. Gain proficiency in automating data ingestion and transformation using Amazon tools. Perform efficient data queries and analysis leveraging Amazon Athena and Redshift. Create insightful data visualizations using Amazon QuickSight. Apply machine learning techniques to enhance data engineering processes. Author(s) None Eagar, a Senior Data Architect with over twenty-five years of experience, specializes in modern data architectures and cloud solutions. With a rich background in applying data engineering to real-world problems, None Eagar shares expertise in a clear and approachable way for readers. Who is it for? This book is perfect for data engineers and data architects aiming to grow their expertise in AWS-based solutions. It's also geared towards beginners in data engineering wanting to adopt the best practices. Those with a basic understanding of big data and cloud platforms will find it particularly valuable, but prior AWS experience is not required.

Data Mesh in Practice

2021-12-25 O'Reilly Amazon

book

Arif Wider , Max Schultze

data data-engineering database-architecture data-mesh Analytics Data Quality

The data mesh is poised to replace data lakes and data warehouses as the dominant architectural pattern in data and analytics. By promoting the concept of domain-focused data products that go beyond file sharing, data mesh helps you deal with data quality at scale by establishing true data ownership. This approach is so new, however, that many misconceptions and a general lack of practical experience for implementing data mesh are widespread. With this report, you'll learn how to successfully overcome challenges in the adoption process. By drawing on their experience building large-scale data infrastructure, designing data architectures, and contributing to data strategies of large and successful corporations, authors Max Schultze and Arif Wider have identified the most common pain points along the data mesh journey. You'll examine the foundations of the data mesh paradigm and gain both technical and organizational insights. This report is ideal for companies just starting to work with data, for organizations already in the process of transforming their data infrastructure landscape, as well as for advanced companies working on federated governance setups for a sustainable data-driven future. This report covers: Data mesh principles and practical examples for getting started Typical challenges and solutions you'll encounter when implementing a data mesh Data mesh pillars including domain ownership, data as a product, and infrastructure as a platform How to move toward a decentralized data product and build a data infrastructure platform

Optimizing Databricks Workloads

2021-12-24 O'Reilly Amazon

book

Sarthak Sarbahi , Anirudh Kala , Anshul Bhatnagar

data data-engineering apache-spark Analytics Big Data Cloud Computing

Unlock the full potential of Apache Spark on the Databricks platform with "Optimizing Databricks Workloads". This book equips you with must-know techniques to effectively configure, manage, and optimize big data processing pipelines. Dive into real-world scenarios and learn practical approaches to reduce costs and improve performance in your data engineering processes. What this Book will help me do Understand and apply optimization techniques for Databricks workloads. Choose the right cluster configurations to maximize efficiency and minimize costs. Leverage Delta Lake for performance-boosted data processing and optimization. Develop skills for managing Spark DataFrames and core functionalities in Databricks. Gain insights into real-world scenarios to effectively improve workload performance. Author(s) Anirudh Kala and the co-authors are experienced practitioners in the fields of data engineering and analytics. With years of professional expertise in leveraging Apache Spark and Databricks, they bring real-world insight into performance optimization. Their approach blends practical instruction with actionable strategies, making this book an essential guide for data engineers aiming to excel in this domain. Who is it for? This book is tailored for data engineers, data scientists, and cloud architects looking to elevate their skills in managing Databricks workloads. Ideal for readers with basic knowledge of Spark and Databricks, it helps them get hands-on with optimization techniques. If you are aiming to enhance your Spark-based data processing systems, this book offers the guidance you need.

Securing IBM Spectrum Scale with QRadar and IBM Cloud Pak for Security

2021-12-20 O'Reilly Amazon

book

IBM

data data-engineering IBM Cloud Computing Cyber Security

Cyberattacks are likely to remain a significant risk for the foreseeable future. Attacks on organizations can be external and internal. Investing in technology and processes to prevent these cyberattacks is the highest priority for these organizations. Organizations need well-designed procedures and processes to recover from attacks. The focus of this document is to demonstrate how the IBM® Unified Data Foundation (UDF) infrastructure plays an important role in delivering the persistence storage (PV) to containerized applications, such as IBM Cloud® Pak for Security (CP4S), with IBM Spectrum® Scale Container Native Storage Access (CNSA) that is deployed with IBM Spectrum scale CSI driver and IBM FlashSystem® storage with IBM Block storage driver with CSI driver. Also demonstrated is how this UDF infrastructure can be used as a preferred storage class to create back-end persistent storage for CP4S deployments. We also highlight how the file I/O events are captured in IBM QRadar® and offenses are generated based on predefined rules. After the offenses are generated, we show how the cases are automatically generated in IBM Cloud Pak® for Security by using the IBM QRadar SOAR Plugin, with a manually automated method to log a case in IBM Cloud Pak for Security. This document also describes the processes that are required for the configuration and integration of the components in this solution, such as: Integration of IBM Spectrum Scale with QRadar QRadar integration with IBM Cloud Pak for Security Integration of the IBM QRadar SOAR Plugin to generate automated cases in CP4S. Finally, this document shows the use of IBM Spectrum Scale CNSA and IBM FlashSystem storage that uses IBM block CSI driver to provision persistent volumes for CP4S deployment. All models of IBM FlashSystem family are supported by this document, including: FlashSystem 9100 and 9200 FlashSystem 7200 and FlashSystem 5000 models FlashSystem 5200 IBM SAN Volume Controller All storage that is running IBM Spectrum Virtualize software

IBM DS8900F Performance Best Practices and Monitoring

2021-12-17 O'Reilly Amazon

book

Lisa Martinez , Peter Kimmel , Rick Pekosh , Ali Rizvi , Ewerson Palacio , Luiz Fernando Moreira , Sherri Brunson , Paul Smith

data data-engineering IBM

This IBM® Redbooks® publication is intended for individuals who want to maximize the performance of their DS8900 storage systems and investigate the planning and monitoring tools that are available.

Access For Dummies

2021-12-14 O'Reilly Amazon

book

Laurie A. Ulrich , Ken Cook

data data-engineering database-management-tools microsoft-access Data Management Data Science

Become a database boss —and have fun doing it—with this accessible and easy-to-follow guide to Microsoft Access Databases hold the key to organizing and accessing all your data in one convenient place. And you don’t have to be a data science wizard to build, populate, and organize your own. With Microsoft Access For Dummies, you’ll learn to use the latest version of Microsoft’s Access software to power your database needs. Need to understand the essentials before diving in? Check out our Basic Training in Part 1 where we teach you how to navigate the Access workspace and explore the foundations of databases. Ready for more advanced tutorials? Skip right to the sections on Data Management, Queries, or Reporting where we walk you through Access’s more sophisticated capabilities. Not sure if you have Access via Office 2021 or Office 365? No worries – this book covers Access now matter how you access it. The book also shows you how to: Handle the most common problems that Access users encounter Import, export, and automatically edit data to populate your next database Write powerful and accurate queries to find exactly what you’re looking for, exactly when you need it Microsoft Access For Dummies is the perfect resource for anyone expected to understand, use, or administer Access databases at the workplace, classroom, or any other data-driven destination.

Snowflake Essentials: Getting Started with Big Data in the Cloud

2021-12-14 O'Reilly Amazon

book

Bjorn Lindstrom , Frank Bell , Ruchi Soni , Sameer Videkar , Bhaskar B. Joshi , Raj Chirumamilla

data data-engineering Snowflake Analytics Big Data Cloud Computing

Understand the essentials of the Snowflake Database and the overall Snowflake Data Cloud. This book covers how Snowflake’s architecture is different from prior on-premises and cloud databases. The authors also discuss, from an insider perspective, how Snowflake grew so fast to become the largest software IPO of all time. Snowflake was the first database made specifically to be optimized with a cloud architecture. This book helps you get started using Snowflake by first understanding its architecture and what separates it from other database platforms you may have used. You will learn about setting up users and accounts, and then creating database objects. You will know how to load data into Snowflake and query and analyze that data, including unstructured data such as data in XML and JSON formats. You will also learn about Snowflake’s compute platform and the different data sharing options that are available. What YouWill Learn Run analytics in the Snowflake Data Cloud Create users and roles in Snowflake Set up security in Snowflake Set up resource monitors in Snowflake Set up and optimize Snowflake Compute Load, unload, and query structured and unstructured data (JSON, XML) within Snowflake Use Snowflake Data Sharing to share data Set up a Snowflake Data Exchange Use the Snowflake Data Marketplace Who This Book Is For Database professionals or information technology professionals who want to move beyond traditional database technologies by learning Snowflake, a new and massively scalable cloud-based database solution

Apache Pulsar in Action

2021-12-13 O'Reilly Amazon

book

David Kjerrumgaard

data data-engineering apache-pulsar Analytics Cloud Computing IoT

Deliver lightning fast and reliable messaging for your distributed applications with the flexible and resilient Apache Pulsar platform. In Apache Pulsar in Action you will learn how to: Publish from Apache Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Perform interactive SQL queries against data stored in Apache Pulsar Apache Pulsar in Action is a comprehensive and practical guide to building high-traffic applications with Pulsar. You’ll learn to use this mature and battle-tested platform to deliver extreme levels of speed and durability to your messaging. Apache Pulsar committer David Kjerrumgaard teaches you to apply Pulsar’s seamless scalability through hands-on case studies, including IOT analytics applications and a microservices app based on Pulsar functions. About the Technology Reliable server-to-server messaging is the heart of a distributed application. Apache Pulsar is a flexible real-time messaging platform built to run on Kubernetes and deliver the scalability and resilience required for cloud-based systems. Pulsar supports both streaming and message queuing, and unlike other solutions, it can communicate over multiple protocols including MQTT, AMQP, and Kafka’s binary protocol. About the Book Apache Pulsar in Action teaches you to build scalable streaming messaging systems using Pulsar. You’ll start with a rapid introduction to enterprise messaging and discover the unique benefits of Pulsar. Following crystal-clear explanations and engaging examples, you’ll use the Pulsar Functions framework to develop a microservices-based application. Real-world case studies illustrate how to implement the most important messaging design patterns. What's Inside Publish from Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Create an event-driven food delivery application About the Reader Written for experienced Java developers. No prior knowledge of Pulsar required. About the Author David Kjerrumgaard is a committer on the Apache Pulsar project. He currently serves as a Developer Advocate for StreamNative, where he develops Pulsar best practices and solutions. Quotes Apache Pulsar in Action is able to seamlessly mix the theory and abstract concepts with the clarity of practical step-by-step examples. I’d recommend to anyone! - Matteo Merli, co-creator of Apache Pulsar Gives readers insights into how the ‘magic’ works… Definitely recommended. - Henry Saputra, Splunk A complete, practical, fun-filled book. - Satej Kumar Sahu, Honeywell A definitive guide that will help you scale your applications. - Alessandro Campeis, Vimar The best book to start working with Pulsar. - Emanuele Piccinelli, Empirix

Machine Learning with PySpark: With Natural Language Processing and Recommender Systems

2021-12-08 O'Reilly Amazon

book

Pramod Singh

data data-engineering apache-spark PySpark AI/ML Airflow

Master the new features in PySpark 3.1 to develop data-driven, intelligent applications. This updated edition covers topics ranging from building scalable machine learning models, to natural language processing, to recommender systems. Machine Learning with PySpark, Second Edition begins with the fundamentals of Apache Spark, including the latest updates to the framework. Next, you will learn the full spectrum of traditional machine learning algorithm implementations, along with natural language processing and recommender systems. You’ll gain familiarity with the critical process of selecting machine learning algorithms, data ingestion, and data processing to solve business problems. You’ll see a demonstration of how to build supervised machine learning models such as linear regression, logistic regression, decision trees, and random forests. You’ll also learn how to automate the steps using Spark pipelines, followed by unsupervised models such as K-means and hierarchical clustering. A section on Natural Language Processing (NLP) covers text processing, text mining, and embeddings for classification. This new edition also introduces Koalas in Spark and how to automate data workflow using Airflow and PySpark’s latest ML library. After completing this book, you will understand how to use PySpark’s machine learning library to build and train various machine learning models, along with related components such as data ingestion, processing and visualization to develop data-driven intelligent applications What you will learn: Build a spectrum of supervised and unsupervised machine learning algorithms Use PySpark's machine learning library to implement machine learning and recommender systems Leverage the new features in PySpark’s machine learning library Understand data processing using Koalas in Spark Handle issues around feature engineering, class balance, bias andvariance, and cross validation to build optimally fit models Who This Book Is For Data science and machine learning professionals.

Mastering Apache Pulsar

2021-12-06 O'Reilly Amazon

book

Jowanza Joseph

data data-engineering apache-pulsar Flink API Big Data

Every enterprise application creates data, including log messages, metrics, user activity, and outgoing messages. Learning how to move these items is almost as important as the data itself. If you're an application architect, developer, or production engineer new to Apache Pulsar, this practical guide shows you how to use this open source event streaming platform to handle real-time data feeds. Jowanza Joseph, staff software engineer at Finicity, explains how to deploy production Pulsar clusters, write reliable event streaming applications, and build scalable real-time data pipelines with this platform. Through detailed examples, you'll learn Pulsar's design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the load manager, and the storage layer. This book helps you: Understand how event streaming fits in the big data ecosystem Explore Pulsar producers, consumers, and readers for writing and reading events Build scalable data pipelines by connecting Pulsar with external systems Simplify event-streaming application building with Pulsar Functions Manage Pulsar to perform monitoring, tuning, and maintenance tasks Use Pulsar's operational measurements to secure a production cluster Process event streams using Flink and query event streams using Presto

Cloud-Native Microservices with Apache Pulsar: Build Distributed Messaging Microservices

2021-12-04 O'Reilly Amazon

book

Rahul Sharma , Mohammad Atyab

data data-engineering apache-pulsar AWS AWS Lambda Cloud Computing

Apply different enterprise integration and processing strategies available with Pulsar, Apache's multi-tenant, high-performance, cloud-native messaging and streaming platform. This book is a comprehensive guide that examines using Pulsar Java libraries to build distributed applications with message-driven architecture. You'll begin with an introduction to Apache Pulsar architecture. The first few chapters build a foundation of message-driven architecture. Next, you'll perform a setup of all the required Pulsar components. The book also covers work with Apache Pulsar client library to build producers and consumers for the discussed patterns. You'll then explore the transformation, filter, resiliency, and tracing capabilities available with Pulsar. Moving forward, the book will discuss best practices when building message schemas and demonstrate integration patterns using microservices. Security is an important aspect of any application;the book will cover authentication and authorization in Apache Pulsar such as Transport Layer Security (TLS), OAuth 2.0, and JSON Web Token (JWT). The final chapters will cover Apache Pulsar deployment in Kubernetes. You'll build microservices and serverless components such as AWS Lambda integrated with Apache Pulsar on Kubernetes. After completing the book, you'll be able to comfortably work with the large set of out-of-the-box integration options offered by Apache Pulsar. What You'll Learn Examine the important Apache Pulsar components Build applications using Apache Pulsar client libraries Use Apache Pulsar effectively with microservices Deploy Apache Pulsar to the cloud Who This Book Is For Cloud architects and software developers who build systems in the cloud-native technologies.

Efficient MySQL Performance

2021-11-30 O'Reilly Amazon

book

Daniel Nichter

data data-engineering relational-databases MySQL Cloud Computing SQL

You'll find several books on basic or advanced MySQL performance, but nothing in between. That's because explaining MySQL performance without addressing its complexity is difficult. This practical book bridges the gap by teaching software engineers mid-level MySQL knowledge beyond the fundamentals, but well shy of deep-level internals required by database administrators (DBAs). Daniel Nichter shows you how to apply the best practices and techniques that directly affect MySQL performance. You'll learn how to improve performance by analyzing query execution, indexing for common SQL clauses and table joins, optimizing data access, and understanding the most important MySQL metrics. You'll also discover how replication, transactions, row locking, and the cloud influenceMySQL performance. Understand why query response time is the North Star of MySQL performance Learn query metrics in detail, including aggregation, reporting, and analysis See how to index effectively for common SQL clauses and table joins Explore the most important server metrics and what they reveal about performance Dive into transactions and row locking to gain deep, actionable insight Achieve remarkable MySQL performance at any scale

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Highly Efficient Data Access with RoCE on IBM Elastic Storage Systems and IBM Spectrum Scale

Kafka in Action

PHP & MySQL: Novice to Ninja, 7th Edition

Data Privacy

IBM FlashSystem Best Practices and Performance Guidelines for IBM Spectrum Virtualize Version 8.4.2

Why External Data Needs to Be Part of Your Data and Analytics Strategy

Cassandra: The Definitive Guide, (Revised) Third Edition, 3rd Edition

Building Big Data Pipelines with Apache Beam

IBM Storage Networking c-type FICON Implementation Guide

SAN and Fabric Resiliency Best Practices for IBM b-type Products

Getting Started with IBM Hyper Protect Data Controller

Installing and Configuring IBM Db2 AI for IBM z/OS v1.4.0

SAP Enterprise Portfolio and Project Management: A Guide to Implement, Integrate, and Deploy EPPM Solutions

Data Engineering with AWS

Data Mesh in Practice

Optimizing Databricks Workloads

Securing IBM Spectrum Scale with QRadar and IBM Cloud Pak for Security

IBM DS8900F Performance Best Practices and Monitoring

Access For Dummies

Snowflake Essentials: Getting Started with Big Data in the Cloud

Apache Pulsar in Action

Machine Learning with PySpark: With Natural Language Processing and Recommender Systems

Mastering Apache Pulsar

Cloud-Native Microservices with Apache Pulsar: Build Distributed Messaging Microservices

Efficient MySQL Performance