Analytics

Simplify Big Data Analytics with Amazon EMR

2022-03-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sakti Mishra (AWS)

AWS Amazon EMR Big Data Cloud Computing Data Analytics Data Governance ETL/ELT Hadoop Java Python Scala Cyber Security +5 more

Simplify Big Data Analytics with Amazon EMR is a thorough guide to harnessing Amazon's EMR service for big data processing and analytics. From distributed computation pipelines to real-time streaming analytics, this book provides hands-on knowledge and actionable steps for implementing data solutions efficiently. What this Book will help me do Understand the architecture and key components of Amazon EMR and how to deploy it effectively. Learn to configure and manage distributed data processing pipelines using Amazon EMR. Implement security and data governance best practices within the Amazon EMR ecosystem. Master batch ETL and real-time analytics techniques using technologies like Apache Spark. Apply optimization and cost-saving strategies to scalable data solutions. Author(s) Sakti Mishra is a seasoned data professional with extensive expertise in deploying scalable analytics solutions on cloud platforms like AWS. With a background in big data technologies and a passion for teaching, Sakti ensures practical insights accompany every concept. Readers will find his approach thorough, hands-on, and highly informative. Who is it for? This book is perfect for data engineers, data scientists, and other professionals looking to leverage Amazon EMR for scalable analytics. If you are familiar with Python, Scala, or Java and have some exposure to Hadoop or AWS ecosystems, this book will empower you to design and implement robust data pipelines efficiently.

Data Analytics, Computational Statistics, and Operations Research for Engineers

2022-03-24 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mohammad Hammoudeh , Naveen Chilamkurti , Debabrata Samanta , SK Hafizul Islam

AI/ML Data Analytics data data-science data-science-tasks statistics

This book investigates the role of data mining in computational statistics for machine learning. It offers applications that can be used in various domains and examines the role of transformation functions in optimizing problem statements.

Data Lakehouse in Action

2022-03-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pradeep Menon

Azure Cloud Computing Data Analytics Data Governance Data Lakehouse Cyber Security data data-engineering data-lake storage-repositories

"Data Lakehouse in Action" provides a comprehensive exploration of the Data Lakehouse architecture, a modern solution for scalable and effective large-scale analytics. This book guides you through understanding the principles and components of the architecture, and its implementation using cloud platforms like Azure. Learn the practical techniques for designing robust systems tailored to organizational needs and maturity. What this Book will help me do Understand the evolution and need for modern data architecture patterns like Data Lakehouse. Learn how to design systems for data ingestion, storage, processing, and serving in a Data Lakehouse. Develop best practices for data governance and security in the Data Lakehouse architecture. Discover various analytics workflows enabled by the Data Lakehouse, including real-time and batch approaches. Implement practical Data Lakehouse patterns on a cloud platform, and integrate them with macro-patterns such as Data Mesh. Author(s) Pradeep Menon is a seasoned data architect and engineer with extensive experience implementing data analytics solutions for leading companies. With a penchant for simplifying complex architectures, Pradeep has authored several technical publications and frequently shares his expertise at industry conferences. His hands-on approach and passion for teaching shine through in his practical guides. Who is it for? This book is ideal for data professionals including architects, engineers, and data strategists eager to enhance their knowledge in modern analytics platforms. If you have a basic understanding of data architecture and are curious about implementing systems governed by the Data Lakehouse paradigm, this book is for you. It bridges foundational concepts with advanced practices, making it suitable for learners aiming to contribute effectively to their organization's analytics efforts.

Data Analysis with Python and PySpark

2022-03-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jonathan Rioux

AI/ML API Big Data Cloud Computing Data Science Hadoop Microsoft Pandas PySpark Python Spark apache-spark +2 more

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines. In Data Analysis with Python and PySpark you will learn how to: Manage your data as it scales across multiple machines Scale up your data programs with full confidence Read and write data to and from a variety of sources and formats Deal with messy data with PySpark’s data manipulation functionality Discover new data sets and perform exploratory data analysis Build automated data pipelines that transform, summarize, and get insights from data Troubleshoot common PySpark errors Creating reliable long-running jobs Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required. About the Technology The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem. About the Book Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code. What's Inside Organizing your PySpark code Managing your data, no matter the size Scale up your data programs with full confidence Troubleshooting common data pipeline problems Creating reliable long-running jobs About the Reader Written for data scientists and data engineers comfortable with Python. About the Author As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts. Quotes A clear and in-depth introduction for truly tackling big data with Python. - Gustavo Patino, Oakland University William Beaumont School of Medicine The perfect way to learn how to analyze and master huge datasets. - Gary Bake, Brambles Covers both basic and more advanced topics of PySpark, with a good balance between theory and hands-on. - Philippe Van Bergenl, P² Consulting For beginner to pro, a well-written book to help understand PySpark. - Raushan Kumar Jha, Microsoft

Data Mesh

2022-03-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Zhamak Dehghani (Nextdata)

AI/ML Big Data Data Governance Data Management data data-engineering data-mesh database-architecture

We're at an inflection point in data, where our data management solutions no longer match the complexity of organizations, the proliferation of data sources, and the scope of our aspirations to get value from data with AI and analytics. In this practical book, author Zhamak Dehghani introduces data mesh, a decentralized sociotechnical paradigm drawn from modern distributed architecture that provides a new approach to sourcing, sharing, accessing, and managing analytical data at scale. Dehghani guides practitioners, architects, technical leaders, and decision makers on their journey from traditional big data architecture to a distributed and multidimensional approach to analytical data management. Data mesh treats data as a product, considers domains as a primary concern, applies platform thinking to create self-serve data infrastructure, and introduces a federated computational model of data governance. Get a complete introduction to data mesh principles and its constituents Design a data mesh architecture Guide a data mesh strategy and execution Navigate organizational design to a decentralized data ownership model Move beyond traditional data warehouses and lakes to a distributed data mesh

Mastering Snowflake Solutions: Supporting Analytics and Data Sharing

2022-02-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Adam Morton

Agile/Scrum BI GDPR/CCPA Cyber Security Snowflake data data-engineering

Design for large-scale, high-performance queries using Snowflake’s query processing engine to empower data consumers with timely, comprehensive, and secure access to data. This book also helps you protect your most valuable data assets using built-in security features such as end-to-end encryption for data at rest and in transit. It demonstrates key features in Snowflake and shows how to exploit those features to deliver a personalized experience to your customers. It also shows how to ingest the high volumes of both structured and unstructured data that are needed for game-changing business intelligence analysis. Mastering Snowflake Solutions starts with a refresher on Snowflake’s unique architecture before getting into the advanced concepts that make Snowflake the market-leading product it is today. Progressing through each chapter, you will learn how to leverage storage, query processing, cloning, data sharing, and continuous data protection features. This approach allows for greater operational agility in responding to the needs of modern enterprises, for example in supporting agile development techniques via database cloning. The practical examples and in-depth background on theory in this book help you unleash the power of Snowflake in building a high-performance system with little to no administrative overhead. Your result from reading will be a deep understanding of Snowflake that enables taking full advantage of Snowflake’s architecture to deliver value analytics insight to your business. What You Will Learn Optimize performance and costs associated with your use of the Snowflake data platform Enable data security to help in complying with consumer privacy regulations such as CCPA and GDPR Share data securely both inside your organization and with external partners Gain visibility to each interaction with your customersusing continuous data feeds from Snowpipe Break down data silos to gain complete visibility your business-critical processes Transform customer experience and product quality through real-time analytics Who This Book Is for Data engineers, scientists, and architects who have had some exposure to the Snowflake data platform or bring some experience from working with another relational database. This book is for those beginning to struggle with new challenges as their Snowflake environment begins to mature, becoming more complex with ever increasing amounts of data, users, and requirements. New problems require a new approach and this book aims to arm you with the practical knowledge required to take advantage of Snowflake’s unique architecture to get the results you need.

Analytics Optimization with Columnstore Indexes in Microsoft SQL Server: Optimizing OLAP Workloads

2022-02-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Edward Pollack

BI Microsoft SQL SQL Server data data-engineering microsoft-sql-server relational-databases

Meet the challenge of storing and accessing analytic data in SQL Server in a fast and performant manner. This book illustrates how columnstore indexes can provide an ideal solution for storing analytic data that leads to faster performing analytic queries and the ability to ask and answer business intelligence questions with alacrity. The book provides a complete walk through of columnstore indexing that encompasses an introduction, best practices, hands-on demonstrations, explanations of common mistakes, and presents a detailed architecture that is suitable for professionals of all skill levels. With little or no knowledge of columnstore indexing you can become proficient with columnstore indexes as used in SQL Server, and apply that knowledge in development, test, and production environments. This book serves as a comprehensive guide to the use of columnstore indexes and provides definitive guidelines. You will learn when columnstore indexes shouldbe used, and the performance gains that you can expect. You will also become familiar with best practices around architecture, implementation, and maintenance. Finally, you will know the limitations and common pitfalls to be aware of and avoid. As analytic data can become quite large, the expense to manage it or migrate it can be high. This book shows that columnstore indexing represents an effective storage solution that saves time, money, and improves performance for any applications that use it. You will see that columnstore indexes are an effective performance solution that is included in all versions of SQL Server, with no additional costs or licensing required. What You Will Learn Implement columnstore indexes in SQL Server Know best practices for the use and maintenance of analytic data in SQL Server Use metadata to fully understand the size and shape of data stored in columnstore indexes Employ optimal ways to load, maintain, and delete data from large analytic tables Know how columnstore compression saves storage, memory, and time Understand when a columnstore index should be used instead of a rowstore index Be familiar with advanced features and analytics Who This Book Is For Database developers, administrators, and architects who are responsible for analytic data, especially for those working with very large data sets who are looking for new ways to achieve high performance in their queries, and those with immediate or future challenges to analytic data and query performance who want a methodical and effective solution

Kafka in Action

2022-02-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dave Klein , Dylan Scott , Viktor Gamov (Confluent)

ETL/ELT Java Kafka Data Streaming data data-engineering streaming-messaging

Master the wicked-fast Apache Kafka streaming platform through hands-on examples and real-world projects. In Kafka in Action you will learn: Understanding Apache Kafka concepts Setting up and executing basic ETL tasks using Kafka Connect Using Kafka as part of a large data project team Performing administrative tasks Producing and consuming event streams Working with Kafka from Java applications Implementing Kafka as a message queue Kafka in Action is a fast-paced introduction to every aspect of working with Apache Kafka. Starting with an overview of Kafka's core concepts, you'll immediately learn how to set up and execute basic data movement tasks and how to produce and consume streams of events. Advancing quickly, you’ll soon be ready to use Kafka in your day-to-day workflow, and start digging into even more advanced Kafka topics. About the Technology Think of Apache Kafka as a high performance software bus that facilitates event streaming, logging, analytics, and other data pipeline tasks. With Kafka, you can easily build features like operational data monitoring and large-scale event processing into both large and small-scale applications. About the Book Kafka in Action introduces the core features of Kafka, along with relevant examples of how to use it in real applications. In it, you’ll explore the most common use cases such as logging and managing streaming data. When you’re done, you’ll be ready to handle both basic developer- and admin-based tasks in a Kafka-focused team. What's Inside Kafka as an event streaming platform Kafka producers and consumers from Java applications Kafka as part of a large data project About the Reader For intermediate Java developers or data engineers. No prior knowledge of Kafka required. About the Authors Dylan Scott is a software developer in the insurance industry. Viktor Gamov is a Kafka-focused developer advocate. At Confluent, Dave Klein helps developers, teams, and enterprises harness the power of event streaming with Apache Kafka. Quotes The authors have had many years of real-world experience using Kafka, and this book’s on-the-ground feel really sets it apart. - From the foreword by Jun Rao, Confluent Cofounder A surprisingly accessible introduction to a very complex technology. Developers will want to keep a copy close by. - Conor Redmond, InComm Payments A comprehensive and practical guide to Kafka and the ecosystem. - Sumant Tambe, Linkedin It quickly gave me insight into how Kafka works, and how to design and protect distributed message applications. - Gregor Rayman, Cloudfarms

Why External Data Needs to Be Part of Your Data and Analytics Strategy

2022-01-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Joseph D. Stec

Data Management data data-engineering

Innovative organizations today are reaping the benefits of combining data from a variety of internal and external sources. By collecting, storing, analyzing, and leveraging external data, these companies are able to outperform competitors by unlocking improvements in growth, productivity, and risk management. This report explains how you can harness the power of external data to boost analytics, find competitive advantages, and drive value. Author Joseph D. Stec explains how clever companies are now using advanced analytics tools that can simultaneously collect, mix, and match diverse data from disparate data sources. This enables them to improve products and brand loyalty, generate better conversions, identify trends earlier, and pinpoint additional ways to improve customer satisfaction. With this report, you will: Learn how external data elevates and enhances the way you analyze and interpret data outside of your apps or databases Dive into the nuts and bolts of external data platforms to solve key challenges Understand how new technology makes external data easier to use with analytics Learn how an external data platform fits into your data architecture Gain access to relevant external data signals with Explorium, an automated external data management platform Unlock improvements in growth, productivity, and risk management

Data Mesh in Practice

2021-12-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Arif Wider , Max Schultze (HelloFresh)

Data Quality data data-engineering data-mesh database-architecture

The data mesh is poised to replace data lakes and data warehouses as the dominant architectural pattern in data and analytics. By promoting the concept of domain-focused data products that go beyond file sharing, data mesh helps you deal with data quality at scale by establishing true data ownership. This approach is so new, however, that many misconceptions and a general lack of practical experience for implementing data mesh are widespread. With this report, you'll learn how to successfully overcome challenges in the adoption process. By drawing on their experience building large-scale data infrastructure, designing data architectures, and contributing to data strategies of large and successful corporations, authors Max Schultze and Arif Wider have identified the most common pain points along the data mesh journey. You'll examine the foundations of the data mesh paradigm and gain both technical and organizational insights. This report is ideal for companies just starting to work with data, for organizations already in the process of transforming their data infrastructure landscape, as well as for advanced companies working on federated governance setups for a sustainable data-driven future. This report covers: Data mesh principles and practical examples for getting started Typical challenges and solutions you'll encounter when implementing a data mesh Data mesh pillars including domain ownership, data as a product, and infrastructure as a platform How to move toward a decentralized data product and build a data infrastructure platform

Optimizing Databricks Workloads

2021-12-24 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sarthak Sarbahi , Anirudh Kala (Celebal Technologies) , Anshul Bhatnagar

Big Data Cloud Computing Data Engineering Databricks Delta Spark apache-spark data data-engineering

Unlock the full potential of Apache Spark on the Databricks platform with "Optimizing Databricks Workloads". This book equips you with must-know techniques to effectively configure, manage, and optimize big data processing pipelines. Dive into real-world scenarios and learn practical approaches to reduce costs and improve performance in your data engineering processes. What this Book will help me do Understand and apply optimization techniques for Databricks workloads. Choose the right cluster configurations to maximize efficiency and minimize costs. Leverage Delta Lake for performance-boosted data processing and optimization. Develop skills for managing Spark DataFrames and core functionalities in Databricks. Gain insights into real-world scenarios to effectively improve workload performance. Author(s) Anirudh Kala and the co-authors are experienced practitioners in the fields of data engineering and analytics. With years of professional expertise in leveraging Apache Spark and Databricks, they bring real-world insight into performance optimization. Their approach blends practical instruction with actionable strategies, making this book an essential guide for data engineers aiming to excel in this domain. Who is it for? This book is tailored for data engineers, data scientists, and cloud architects looking to elevate their skills in managing Databricks workloads. Ideal for readers with basic knowledge of Spark and Databricks, it helps them get hands-on with optimization techniques. If you are aiming to enhance your Spark-based data processing systems, this book offers the guidance you need.

Snowflake Essentials: Getting Started with Big Data in the Cloud

2021-12-14 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bjorn Lindstrom , Frank Bell , Ruchi Soni , Sameer Videkar , Bhaskar B. Joshi , Raj Chirumamilla

Big Data Cloud Computing JSON Cyber Security Snowflake XML data data-engineering

Understand the essentials of the Snowflake Database and the overall Snowflake Data Cloud. This book covers how Snowflake’s architecture is different from prior on-premises and cloud databases. The authors also discuss, from an insider perspective, how Snowflake grew so fast to become the largest software IPO of all time. Snowflake was the first database made specifically to be optimized with a cloud architecture. This book helps you get started using Snowflake by first understanding its architecture and what separates it from other database platforms you may have used. You will learn about setting up users and accounts, and then creating database objects. You will know how to load data into Snowflake and query and analyze that data, including unstructured data such as data in XML and JSON formats. You will also learn about Snowflake’s compute platform and the different data sharing options that are available. What YouWill Learn Run analytics in the Snowflake Data Cloud Create users and roles in Snowflake Set up security in Snowflake Set up resource monitors in Snowflake Set up and optimize Snowflake Compute Load, unload, and query structured and unstructured data (JSON, XML) within Snowflake Use Snowflake Data Sharing to share data Set up a Snowflake Data Exchange Use the Snowflake Data Marketplace Who This Book Is For Database professionals or information technology professionals who want to move beyond traditional database technologies by learning Snowflake, a new and massively scalable cloud-based database solution

Apache Pulsar in Action

2021-12-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by David Kjerrumgaard

Cloud Computing IoT Java Kafka Kubernetes Splunk SQL Data Streaming apache-pulsar data data-engineering

Deliver lightning fast and reliable messaging for your distributed applications with the flexible and resilient Apache Pulsar platform. In Apache Pulsar in Action you will learn how to: Publish from Apache Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Perform interactive SQL queries against data stored in Apache Pulsar Apache Pulsar in Action is a comprehensive and practical guide to building high-traffic applications with Pulsar. You’ll learn to use this mature and battle-tested platform to deliver extreme levels of speed and durability to your messaging. Apache Pulsar committer David Kjerrumgaard teaches you to apply Pulsar’s seamless scalability through hands-on case studies, including IOT analytics applications and a microservices app based on Pulsar functions. About the Technology Reliable server-to-server messaging is the heart of a distributed application. Apache Pulsar is a flexible real-time messaging platform built to run on Kubernetes and deliver the scalability and resilience required for cloud-based systems. Pulsar supports both streaming and message queuing, and unlike other solutions, it can communicate over multiple protocols including MQTT, AMQP, and Kafka’s binary protocol. About the Book Apache Pulsar in Action teaches you to build scalable streaming messaging systems using Pulsar. You’ll start with a rapid introduction to enterprise messaging and discover the unique benefits of Pulsar. Following crystal-clear explanations and engaging examples, you’ll use the Pulsar Functions framework to develop a microservices-based application. Real-world case studies illustrate how to implement the most important messaging design patterns. What's Inside Publish from Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Create an event-driven food delivery application About the Reader Written for experienced Java developers. No prior knowledge of Pulsar required. About the Author David Kjerrumgaard is a committer on the Apache Pulsar project. He currently serves as a Developer Advocate for StreamNative, where he develops Pulsar best practices and solutions. Quotes Apache Pulsar in Action is able to seamlessly mix the theory and abstract concepts with the clarity of practical step-by-step examples. I’d recommend to anyone! - Matteo Merli, co-creator of Apache Pulsar Gives readers insights into how the ‘magic’ works… Definitely recommended. - Henry Saputra, Splunk A complete, practical, fun-filled book. - Satej Kumar Sahu, Honeywell A definitive guide that will help you scale your applications. - Alessandro Campeis, Vimar The best book to start working with Pulsar. - Emanuele Piccinelli, Empirix

Essential PySpark for Scalable Data Analytics

2021-10-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sreeram Nudurupati

AI/ML Big Data Data Analytics Data Lakehouse Data Science PySpark Python SQL apache-spark data data-engineering

Dive into the world of scalable data processing with 'Essential PySpark for Scalable Data Analytics'. This book is a comprehensive guide that helps beginners understand and utilize PySpark to process, analyze, and draw insights from large datasets effectively. With hands-on tutorials and clear explanations, you will gain the confidence to tackle big data analytics challenges. What this Book will help me do Understand and apply the distributed computing paradigm for big data. Learn to perform scalable data ingestion, cleansing, and preparation using PySpark. Create and utilize data lakes and the Lakehouse paradigm for efficient data storage and access. Develop and deploy machine learning models with scalability in mind. Master real-time analytics pipelines and create impactful data visualizations. Author(s) None Nudurupati is an experienced data engineer and educator, specializing in distributed systems and big data technologies. With years of practical experience in the field, None brings a clear and approachable teaching style to technical topics. Passionate about empowering readers, the author has designed this book to be both practical and inspirational for aspiring data practitioners. Who is it for? This book is ideal for data professionals including data scientists, engineers, and analysts looking to scale their data analytics processes. It assumes familiarity with basic data science concepts and Python, as well as some experience with SQL-like data analysis. This is particularly suitable for individuals aiming to expand their knowledge in distributed computing and PySpark to handle big data challenges. Achieving scalable and efficient data solutions is at the core of this guide.

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

2021-10-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Manoj Kukreja

Big Data Data Engineering Data Lakehouse Data Science Delta Python Spark SQL apache-spark data data-engineering

Data Engineering with Apache Spark, Delta Lake, and Lakehouse is a comprehensive guide packed with practical knowledge for building robust and scalable data pipelines. Throughout this book, you will explore the core concepts and applications of Apache Spark and Delta Lake, and learn how to design and implement efficient data engineering workflows using real-world examples. What this Book will help me do Master the core concepts and components of Apache Spark and Delta Lake. Create scalable and secure data pipelines for efficient data processing. Learn best practices and patterns for building enterprise-grade data lakes. Discover how to operationalize data models into production-ready pipelines. Gain insights into deploying and monitoring data pipelines effectively. Author(s) None Kukreja is a seasoned data engineer with over a decade of experience working with big data platforms. He specializes in implementing efficient and scalable data solutions to meet the demands of modern analytics and data science. Writing with clarity and a practical approach, he aims to provide actionable insights that professionals can apply to their projects. Who is it for? This book is tailored for aspiring data engineers and data analysts who wish to delve deeper into building scalable data platforms. It is suitable for those with basic knowledge of Python, Spark, and SQL, and seeking to learn Delta Lake and advanced data engineering concepts. Readers should be eager to develop practical skills for tackling real-world data engineering challenges.

Storage Systems

2021-10-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alexander Thomasian

AWS Aurora Big Data Cloud Computing Cloud Storage Oracle data data-engineering networked-storage-file-systems networked storage & file systems storage-repositories

Storage Systems: Organization, Performance, Coding, Reliability and Their Data Processing was motivated by the 1988 Redundant Array of Inexpensive/Independent Disks proposal to replace large form factor mainframe disks with an array of commodity disks. Disk loads are balanced by striping data into strips—with one strip per disk— and storage reliability is enhanced via replication or erasure coding, which at best dedicates k strips per stripe to tolerate k disk failures. Flash memories have resulted in a paradigm shift with Solid State Drives (SSDs) replacing Hard Disk Drives (HDDs) for high performance applications. RAID and Flash have resulted in the emergence of new storage companies, namely EMC, NetApp, SanDisk, and Purestorage, and a multibillion-dollar storage market. Key new conferences and publications are reviewed in this book.The goal of the book is to expose students, researchers, and IT professionals to the more important developments in storage systems, while covering the evolution of storage technologies, traditional and novel databases, and novel sources of data. We describe several prototypes: FAWN at CMU, RAMCloud at Stanford, and Lightstore at MIT; Oracle's Exadata, AWS' Aurora, Alibaba's PolarDB, Fungible Data Center; and author's paper designs for cloud storage, namely heterogeneous disk arrays and hierarchical RAID. Surveys storage technologies and lists sources of data: measurements, text, audio, images, and video Familiarizes with paradigms to improve performance: caching, prefetching, log-structured file systems, and merge-trees (LSMs) Describes RAID organizations and analyzes their performance and reliability Conserves storage via data compression, deduplication, compaction, and secures data via encryption Specifies implications of storage technologies on performance and power consumption Exemplifies database parallelism for big data, analytics, deep learning via multicore CPUs, GPUs, FPGAs, and ASICs, e.g., Google's Tensor Processing Units

Azure Databricks Cookbook

2021-09-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Phani Raj , Vinod Jaiswal

Azure Big Data CI/CD Cosmos Databricks Delta Cyber Security Spark SQL Data Streaming Synapse apache-spark +2 more

Azure Databricks is a robust analytics platform that leverages Apache Spark and seamlessly integrates with Azure services. In the Azure Databricks Cookbook, you'll find hands-on recipes to ingest data, build modern data pipelines, and perform real-time analytics while learning to optimize and secure your solutions. What this Book will help me do Design advanced data workflows integrating Azure Synapse, Cosmos DB, and streaming sources with Databricks. Gain proficiency in using Delta Tables and Spark for efficient data storage and analysis. Learn to create, deploy, and manage real-time dashboards with Databricks SQL. Master CI/CD pipelines for automating deployments of Databricks solutions. Understand security best practices for restricting access and monitoring Azure Databricks. Author(s) None Raj and None Jaiswal are experienced professionals in the field of big data and analytics. They are well-versed in implementing Azure Databricks solutions for real-world problems. Their collaborative writing approach ensures clarity and practical focus. Who is it for? This book is tailored for data engineers, scientists, and big data professionals who want to apply Azure Databricks and Apache Spark to their analytics workflows. A basic familiarity with Spark and Azure is recommended to make the best use of the recipes provided. If you're looking to scale and optimize your analytics pipelines, this book is for you.

Foundations of Data Intensive Applications

2021-09-08 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Supun Kamburugamuve , Saliya Ekanayake

API Big Data Data Analytics data data-engineering

PEEK “UNDER THE HOOD” OF BIG DATA ANALYTICS The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance. The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within. Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system. Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: Identify the foundations of large-scale, distributed data processing systems Make major software design decisions that optimize performance Diagnose performance problems and distributed operation issues Understand state-of-the-art research in big data Explain and use the major big data frameworks and understand what underpins them Use big data analytics in the real world to solve practical problems

Cloudera Data Platform Private Cloud Base with IBM Spectrum Scale

2021-08-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by John Sing , Prashanth Shetty , Wei Gong , Linda Cham

CDP Cloud Computing Data Lake Hadoop IBM Spark cloudera data data-engineering

This IBM® Redpaper publication provides guidance on building an enterprise-grade data lake by using IBM Spectrum® Scale and Cloudera Data Platform (CDP) Private Cloud Base for performing in-place Cloudera Hadoop or Cloudera Spark-based analytics. It also covers the benefits of the integrated solution and gives guidance about the types of deployment models and considerations during the implementation of these models. August 2021 update added CES protocol support in Hadoop environment

Developing Modern Applications with a Converged Database

2021-08-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alice LaPlante

API Blockchain Cloud Computing IoT JSON Marketing Cyber Security Data Streaming XML data data-engineering relational-databases

Single-purpose databases were designed to address specific problems and use cases. Given this narrow focus, there are inherent tradeoffs required when trying to accommodate multiple datatypes or workloads in your enterprise environment. The result is data fragmentation that spills over into application development, IT operations, data security, system scalability, and availability. In this report, author Alice LaPlante explains why developing modern, data-driven applications may be easier and more synergistic when using a converged database. Senior developers, architects, and technical decision-makers will learn cloud-native application development techniques for working with both structured and unstructured data. You'll discover ways to run transactional and analytical workloads on a single, unified data platform. This report covers: Benefits and challenges of using a converged database to develop data-driven applications How to use one platform to work with both structured and unstructured data that includes JSON, XML, text and files, spatial and graph, Blockchain, IoT, time series, and relational data Modern development practices on a converged database, including API-driven development, containers, microservices, and event streaming Use case examples including online food delivery, real-time fraud detection, and marketing based on real-time analytics and geospatial targeting

talk-data.com

Activity Trend

Top Events

Top Speakers

Simplify Big Data Analytics with Amazon EMR

Data Analytics, Computational Statistics, and Operations Research for Engineers

Data Lakehouse in Action

Data Analysis with Python and PySpark

Data Mesh

Mastering Snowflake Solutions: Supporting Analytics and Data Sharing

Analytics Optimization with Columnstore Indexes in Microsoft SQL Server: Optimizing OLAP Workloads

Kafka in Action

Why External Data Needs to Be Part of Your Data and Analytics Strategy

Data Mesh in Practice

Optimizing Databricks Workloads

Snowflake Essentials: Getting Started with Big Data in the Cloud

Apache Pulsar in Action

Essential PySpark for Scalable Data Analytics

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Storage Systems

Azure Databricks Cookbook

Foundations of Data Intensive Applications

Cloudera Data Platform Private Cloud Base with IBM Spectrum Scale

Developing Modern Applications with a Converged Database