Data Engineering

The Data Engineer's Guide to Microsoft Fabric

2027-05-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian Henrik Reich (twoday Data & AI)

Data Lakehouse Databricks ETL/ELT Microsoft Fabric Python Spark SQL Data Streaming analytics-platforms data data-science +1 more

Modern data engineering is evolving; and with Microsoft Fabric, the entire data platform experience is being redefined. This essential book offers a fresh, hands-on approach to navigating this shift. Rather than being an introduction to features, this guide explains how Fabric's key components—Lakehouse, Warehouse, and Real-Time Intelligence—work under the hood and how to put them to use in realistic workflows. Written by Christian Henrik Reich, a data engineering expert with experience that extends from Databricks to Fabric, this book is a blend of foundational theory and practical implementation of lakehouse solutions in Fabric. You'll explore how engines like Apache Spark and Fabric Warehouse collaborate with Fabric's Real-Time Intelligence solution in an integrated platform, and how to build ETL/ELT pipelines that deliver on speed, accuracy, and scale. Ideal for both new and practicing data engineers, this is your entry point into the fabric of the modern data platform. Acquire a working knowledge of lakehouses, warehouses, and streaming in Fabric Build resilient data pipelines across real-time and batch workloads Apply Python, Spark SQL, T-SQL, and KQL within a unified platform Gain insight into architectural decisions that scale with data needs Learn actionable best practices for engineering clean, efficient, governed solutions

Data Engineering for Multimodal AI

2026-05-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vasundra Srinivasan

AI/ML Cloud Computing Data Governance ETL/ELT MLOps Cyber Security data data-engineering

A shift is underway in how organizations approach data infrastructure for AI-driven transformation. As multimodal AI systems and applications become increasingly sophisticated and data hungry, data systems must evolve to meet these complex demands. Data Engineering for Multimodal AI is one of the first practical guides for data engineers, machine learning engineers, and MLOps specialists looking to rapidly master the skills needed to build robust, scalable data infrastructures for multimodal AI systems and applications. You'll follow the entire lifecycle of AI-driven data engineering, from conceptualizing data architectures to implementing data pipelines optimized for multimodal learning in both cloud native and on-premises environments. And each chapter includes step-by-step guides and best practices for implementing key concepts. Design and implement cloud native data architectures optimized for multimodal AI workloads Build efficient and scalable ETL processes for preparing diverse AI training data Implement real-time data processing pipelines for multimodal AI inference Develop and manage feature stores that support multiple data modalities Apply data governance and security practices specific to multimodal AI projects Optimize data storage and retrieval for various types of multimodal ML models Integrate data versioning and lineage tracking in multimodal AI workflows Implement data-quality frameworks to ensure reliable outcomes across data types Design data pipelines that support responsible AI practices in a multimodal context

Data Engineering with Azure Databricks

2026-04-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Xenia Ireton , Tonya Chernyshova , Dmitry Foshin , Dmitry Anoshin

AI/ML Airflow Analytics Azure ADF Azure DevOps CI/CD Cloud Computing Data Governance Data Lakehouse Databricks Delta +11 more

Master end-to-end data engineering on Azure Databricks. From data ingestion and Delta Lake to CI/CD and real-time streaming, build secure, scalable, and performant data solutions with Spark, Unity Catalog, and ML tools. Key Features Build scalable data pipelines using Apache Spark and Delta Lake Automate workflows and manage data governance with Unity Catalog Learn real-time processing and structured streaming with practical use cases Implement CI/CD, DevOps, and security for production-ready data solutions Explore Databricks-native ML, AutoML, and Generative AI integration Book Description "Data Engineering with Azure Databricks" is your essential guide to building scalable, secure, and high-performing data pipelines using the powerful Databricks platform on Azure. Designed for data engineers, architects, and developers, this book demystifies the complexities of Spark-based workloads, Delta Lake, Unity Catalog, and real-time data processing. Beginning with the foundational role of Azure Databricks in modern data engineering, you’ll explore how to set up robust environments, manage data ingestion with Auto Loader, optimize Spark performance, and orchestrate complex workflows using tools like Azure Data Factory and Airflow. The book offers deep dives into structured streaming, Delta Live Tables, and Delta Lake’s ACID features for data reliability and schema evolution. You’ll also learn how to manage security, compliance, and access controls using Unity Catalog, and gain insights into managing CI/CD pipelines with Azure DevOps and Terraform. With a special focus on machine learning and generative AI, the final chapters guide you in automating model workflows, leveraging MLflow, and fine-tuning large language models on Databricks. Whether you're building a modern data lakehouse or operationalizing analytics at scale, this book provides the tools and insights you need. What you will learn Set up a full-featured Azure Databricks environment Implement batch and streaming ingestion using Auto Loader Optimize Spark jobs with partitioning and caching Build real-time pipelines with structured streaming and DLT Manage data governance using Unity Catalog Orchestrate production workflows with jobs and ADF Apply CI/CD best practices with Azure DevOps and Git Secure data with RBAC, encryption, and compliance standards Use MLflow and Feature Store for ML pipelines Build generative AI applications in Databricks Who this book is for This book is for data engineers, solution architects, cloud professionals, and software engineers seeking to build robust and scalable data pipelines using Azure Databricks. Whether you're migrating legacy systems, implementing a modern lakehouse architecture, or optimizing data workflows for performance, this guide will help you leverage the full power of Databricks on Azure. A basic understanding of Python, Spark, and cloud infrastructure is recommended.

Google Cloud Certified Professional Data Engineer Certification Guide

2026-02-20 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ankur Roy , Sireesha Pulipati

Cloud Computing DWH GCP Cyber Security SQL cloud-computing cloud-platforms gcp-certifications gcp-certifications-professional-tier google-cloud google-cloud-certified-professional-data-engineer google cloud certified - professional data engineer +1 more

A guide to pass the GCP Professional Data Engineer exam on your first attempt and upgrade your data engineering skills on GCP. Key Features Fully understand the certification exam content and exam objectives Consolidate your knowledge of all essential exam topics and key concepts Get realistic experience of answering exam-style questions Develop practical skills for everyday use Purchase of this book unlocks access to web-based exam prep resources including mock exams, flashcards, exam tips Book Description The GCP Professional Data Engineer certification validates the fundamental knowledge required to perform data engineering tasks and use GCP services to enhance data engineering processes and further your career in the data engineering/architecting field. This book is a best-in-class study guide that fully covers the GCP Professional Data Engineer exam objectives and helps you pass the exam first time. Complete with clear explanations, chapter review questions, realistic mock exams, and pragmatic solutions, this guide will help you master the core exam concepts and build the understanding you need to go into the exam with the skills and confidence to get the best result you can. With the help of relevant examples, you'll learn fundamental data engineering concepts such as data warehousing and data security. As you progress, you'll delve into the important domains of the exam, including data pipelining, data migration, and data processing. Unlike other study guides, this book contains logical reasoning behind the choice of correct answers based in scenarios and provide you with excellent tips regarding the optimal use of each service, and gives you everything you need to pass the exam and enhance your prospects in the data engineering field. What you will learn Create data solutions and pipelines in GCP Analyze and transform data into useful information Apply data engineering concepts to real scenarios Create secure, cost-effective, valuable GCP workloads Work in the GCP environment with industry best practices Who this book is for This book is for data engineers who want a reliable source for the key concepts and terms present in the most prestigious and highly-sought-after cloud-based data engineering certification. This book will help you improve your data engineering in GCP skills to give you a better chance at earning the GCP Professional Data Engineer Certification. You will already be familiar with the Google Cloud Platform, having either explored it (professionally or personally) for at least a year. You should also have some familiarity with basic data concepts (such as types of data and basic SQL knowledge).

Practical Data Engineering with Apache Projects: Solving Everyday Data Challenges with Spark, Iceberg, Kafka, Flink, and More

2026-01-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dunith Danushka

Airflow Flink Iceberg Kafka Spark Trino data data-engineering streaming-messaging

This book is a comprehensive guide designed to equip you with the practical skills and knowledge necessary to tackle real-world data challenges using Open Source solutions. Focusing on 10 real-world data engineering projects, it caters specifically to data engineers at the early stages of their careers, providing a strong foundation in essential open source tools and techniques such as Apache Spark, Flink, Airflow, Kafka, and many more. Each chapter is dedicated to a single project, starting with a clear presentation of the problem it addresses. You will then be guided through a step-by-step process to solve the problem, leveraging widely-used open-source data tools. This hands-on approach ensures that you not only understand the theoretical aspects of data engineering but also gain valuable experience in applying these concepts to real-world scenarios. At the end of each chapter, the book delves into common challenges that may arise during the implementation of the solution, offering practical advice on troubleshooting these issues effectively. Additionally, the book highlights best practices that data engineers should follow to ensure the robustness and efficiency of their solutions. A major focus of the book is using open-source projects and tools to solve problems encountered in data engineering. In summary, this book is an indispensable resource for data engineers looking to build a strong foundation in the field. By offering practical, real-world projects and emphasizing problem-solving and best practices, it will prepare you to tackle the complex data challenges encountered throughout your career. Whether you are an aspiring data engineer or looking to enhance your existing skills, this book provides the knowledge and tools you need to succeed in the ever-evolving world of data engineering. You Will Learn: The foundational concepts of data engineering and practical experience in solving real-world data engineering problems How to proficiently use open-source data tools like Apache Kafka, Flink, Spark, Airflow, and Trino 10 hands-on data engineering projects Troubleshoot common challenges in data engineering projects Who is this book for: Early-career data engineers and aspiring data engineers who are looking to build a strong foundation in the field; mid-career professionals looking to transition into data engineering roles; and technology enthusiasts interested in gaining insights into data engineering practices and tools.

Data Engineering for Beginners

2025-11-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Chisom Nwokwu

Big Data Cloud Computing Data Governance Data Quality NoSQL Cyber Security data data-engineering

A hands-on technical and industry roadmap for aspiring data engineers In Data Engineering for Beginners, big data expert Chisom Nwokwu delivers a beginner-friendly handbook for everyone interested in the fundamentals of data engineering. Whether you're interested in starting a rewarding, new career as a data analyst, data engineer, or data scientist, or seeking to expand your skillset in an existing engineering role, Nwokwu offers the technical and industry knowledge you need to succeed. The book explains: Database fundamentals, including relational and noSQL databases Data warehouses and data lakes Data pipelines, including info about batch and stream processing Data quality dimensions Data security principles, including data encryption Data governance principles and data framework Big data and distributed systems concepts Data engineering on the cloud Essential skills and tools for data engineering interviews and jobs Data Engineering for Beginners offers an easy-to-read roadmap on a seemingly complicated and intimidating subject. It addresses the topics most likely to cause a beginning data engineer to stumble, clearly explaining key concepts in an accessible way. You'll also find: A comprehensive glossary of data engineering terms Common and practical career paths in the data engineering industry An introduction to key cloud technologies and services you may encounter early in your data engineering career Perfect for practicing and aspiring data analysts, data scientists, and data engineers, Data Engineering for Beginners is an effective and reliable starting point for learning an in-demand skill. It's a powerful resource for everyone hoping to expand their data engineering Skillset and upskill in the big data era.

AWS Certified Data Engineer Associate Study Guide

2025-08-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Anusha Challa (AWS) , Dylan Qu , Sakti Mishra (AWS)

AI/ML AWS Data Governance Data Quality amazon-web-services-aws aws-certifications aws-certifications-associate-tier aws-certified-data-engineer-associate cloud-computing cloud-platforms it-operations

There's no better time to become a data engineer. And acing the AWS Certified Data Engineer Associate (DEA-C01) exam will help you tackle the demands of modern data engineering and secure your place in the technology-driven future. Authors Sakti Mishra, Dylan Qu, and Anusha Challa equip you with the knowledge and sought-after skills necessary to effectively manage data and excel in your career. Whether you're a data engineer, data analyst, or machine learning engineer, you'll discover in-depth guidance, practical exercises, sample questions, and expert advice you need to leverage AWS services effectively and achieve certification. By reading, you'll learn how to: Ingest, transform, and orchestrate data pipelines effectively Select the ideal data store, design efficient data models, and manage data lifecycles Analyze data rigorously and maintain high data quality standards Implement robust authentication, authorization, and data governance protocols Prepare thoroughly for the DEA-C01 exam with targeted strategies and practices

Data Engineering for Cybersecurity

2025-08-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by James Bonifield

Ansible Cloud Computing ELK Git Kafka Linux Logstash PowerShell Redis Cyber Security Data Streaming data +1 more

Security teams rely on telemetry—the continuous stream of logs, events, metrics, and signals that reveal what’s happening across systems, endpoints, and cloud services. But that data doesn’t organize itself. It has to be collected, normalized, enriched, and secured before it becomes useful. That’s where data engineering comes in. In this hands-on guide, cybersecurity engineer James Bonifield teaches you how to design and build scalable, secure data pipelines using free, open source tools such as Filebeat, Logstash, Redis, Kafka, and Elasticsearch and more. You’ll learn how to collect telemetry from Windows including Sysmon and PowerShell events, Linux files and syslog, and streaming data from network and security appliances. You’ll then transform it into structured formats, secure it in transit, and automate your deployments using Ansible. You’ll also learn how to: Encrypt and secure data in transit using TLS and SSH Centrally manage code and configuration files using Git Transform messy logs into structured events Enrich data with threat intelligence using Redis and Memcached Stream and centralize data at scale with Kafka Automate with Ansible for repeatable deployments Whether you’re building a pipeline on a tight budget or deploying an enterprise-scale system, this book shows you how to centralize your security data, support real-time detection, and lay the groundwork for incident response and long-term forensics.

Data Engineering Design Patterns

2025-04-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bartosz Konieczny

Cloud Computing Data Quality data data-engineering

Data projects are an intrinsic part of an organization's technical ecosystem, but data engineers in many companies continue to work on problems that others have already solved. This hands-on guide shows you how to provide valuable data by focusing on various aspects of data engineering, including data ingestion, data quality, idempotency, and more. Author Bartosz Konieczny guides you through the process of building reliable end-to-end data engineering projects, from data ingestion to data observability, focusing on data engineering design patterns that solve common business problems in a secure and storage-optimized manner. Each pattern includes a user-facing description of the problem, solutions, and consequences that place the pattern into the context of real-life scenarios. Throughout this journey, you'll use open source data tools and public cloud services to apply each pattern. You'll learn: Challenges data engineers face and their impact on data systems How these challenges relate to data system components Useful applications of data engineering patterns How to identify and fix issues with your current data components Technology-agnostic solutions to new and existing data projects, with open source implementation examples Bartosz Konieczny is a freelance data engineer who's been coding since 2010. He's held various senior hands-on positions that allowed him to work on many data engineering problems in batch and stream processing.

Accelerating Data Pipeline Development

2025-03-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Josh Hall (Coalesce)

DWH data data-engineering

Today's data engineering teams are overwhelmed—juggling fire drills and endless requests while relying on manual, repetitive processes for building data pipelines. This much-needed tech guide from author Josh Hall introduces a practical approach to streamlining pipeline development, empowering teams to work smarter, not harder. Using Coalesce, a modern development platform, you'll learn to standardize workflows, apply reusable design patterns, and build faster, more efficient pipelines—all without piling on tech debt. Ideal for data engineers, architects, and analysts of all experience levels, the book offers clear explanations of Coalesce's core functionality including configuring environments, defining nodes, and connecting to data warehouses. Packed with workflows and useful takeaways, it's your guide to delivering high-quality, actionable data while reducing pipeline development time. Set up Coalesce and integrate with a data warehouse Use reusable nodes and design patterns for faster development Accelerate pipeline delivery with reduced manual effort Leverage Coalesce Marketplace for advanced functionality

Databricks Certified Data Engineer Associate Study Guide

2025-02-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Derar Alhussein (Acadford)

Data Governance Data Lakehouse Databricks Delta ETL/ELT Spark SQL Data Streaming data data-engineering databricks-data-engineer-associate

Data engineers proficient in Databricks are currently in high demand. As organizations gather more data than ever before, skilled data engineers on platforms like Databricks become critical to business success. The Databricks Data Engineer Associate certification is proof that you have a complete understanding of the Databricks platform and its capabilities, as well as the essential skills to effectively execute various data engineering tasks on the platform. In this comprehensive study guide, you will build a strong foundation in all topics covered on the certification exam, including the Databricks Lakehouse and its tools and benefits. You'll also learn to develop ETL pipelines in both batch and streaming modes. Moreover, you'll discover how to orchestrate data workflows and design dashboards while maintaining data governance. Finally, you'll dive into the finer points of exactly what's on the exam and learn to prepare for it with mock tests. Author Derar Alhussein teaches you not only the fundamental concepts but also provides hands-on exercises to reinforce your understanding. From setting up your Databricks workspace to deploying production pipelines, each chapter is carefully crafted to equip you with the skills needed to master the Databricks Platform. By the end of this book, you'll know everything you need to ace the Databricks Data Engineer Associate certification exam with flying colors, and start your career as a certified data engineer from Databricks! You'll learn how to: Use the Databricks Platform and Delta Lake effectively Perform advanced ETL tasks using Apache Spark SQL Design multi-hop architecture to process data incrementally Build production pipelines using Delta Live Tables and Databricks Jobs Implement data governance using Databricks SQL and Unity Catalog Derar Alhussein is a senior data engineer with a master's degree in data mining. He has over a decade of hands-on experience in software and data projects, including large-scale projects on Databricks. He currently holds eight certifications from Databricks, showcasing his proficiency in the field. Derar is also an experienced instructor, with a proven track record of success in training thousands of data engineers, helping them to develop their skills and obtain professional certifications.

Snowflake Data Engineering

2024-12-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Maja Ferle

AI/ML Analytics API CI/CD Cloud Computing Data Analytics Data Governance ELK Funnel GenAI Microsoft Python +5 more

A practical introduction to data engineering on the powerful Snowflake cloud data platform. Data engineers create the pipelines that ingest raw data, transform it, and funnel it to the analysts and professionals who need it. The Snowflake cloud data platform provides a suite of productivity-focused tools and features that simplify building and maintaining data pipelines. In Snowflake Data Engineering, Snowflake Data Superhero Maja Ferle shows you how to get started. In Snowflake Data Engineering you will learn how to: Ingest data into Snowflake from both cloud and local file systems Transform data using functions, stored procedures, and SQL Orchestrate data pipelines with streams and tasks, and monitor their execution Use Snowpark to run Python code in your pipelines Deploy Snowflake objects and code using continuous integration principles Optimize performance and costs when ingesting data into Snowflake Snowflake Data Engineering reveals how Snowflake makes it easy to work with unstructured data, set up continuous ingestion with Snowpipe, and keep your data safe and secure with best-in-class data governance features. Along the way, you’ll practice the most important data engineering tasks as you work through relevant hands-on examples. Throughout, author Maja Ferle shares design tips drawn from her years of experience to ensure your pipeline follows the best practices of software engineering, security, and data governance. About the Technology Pipelines that ingest and transform raw data are the lifeblood of business analytics, and data engineers rely on Snowflake to help them deliver those pipelines efficiently. Snowflake is a full-service cloud-based platform that handles everything from near-infinite storage, fast elastic compute services, inbuilt AI/ML capabilities like vector search, text-to-SQL, code generation, and more. This book gives you what you need to create effective data pipelines on the Snowflake platform. About the Book Snowflake Data Engineering guides you skill-by-skill through accomplishing on-the-job data engineering tasks using Snowflake. You’ll start by building your first simple pipeline and then expand it by adding increasingly powerful features, including data governance and security, adding CI/CD into your pipelines, and even augmenting data with generative AI. You’ll be amazed how far you can go in just a few short chapters! What's Inside Ingest data from the cloud, APIs, or Snowflake Marketplace Orchestrate data pipelines with streams and tasks Optimize performance and cost About the Reader For software developers and data analysts. Readers should know the basics of SQL and the Cloud. About the Author Maja Ferle is a Snowflake Subject Matter Expert and a Snowflake Data Superhero who holds the SnowPro Advanced Data Engineer and the SnowPro Advanced Data Analyst certifications. Quotes An incredible guide for going from zero to production with Snowflake. - Doyle Turner, Microsoft A must-have if you’re looking to excel in the field of data engineering. - Isabella Renzetti, Data Analytics Consultant & Trainer Masterful! Unlocks the true potential of Snowflake for modern data engineers. - Shankar Narayanan, Microsoft Valuable insights will enhance your data engineering skills and lead to cost-effective solutions. A must read! - Frédéric L’Anglais, Maxa Comprehensive, up-to-date and packed with real-life code examples. - Albert Nogués, Danone

Data Engineering with AWS Cookbook

2024-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Viquar Khan , Trâm Ngọc Phạm , Gonzalo Herreros González , Huda Nofal

Analytics Athena AWS Amazon EMR AWS Glue Big Data Cloud Computing Data Lake ETL/ELT QuickSight Redshift data +1 more

Data Engineering with AWS Cookbook serves as a comprehensive practical guide for building scalable and efficient data engineering solutions using AWS. With this book, you will master implementing data lakes, orchestrating data pipelines, and creating serving layers using AWS's robust services, such as Glue, EMR, Redshift, and Athena. With hands-on exercises and practical recipes, you will enhance your AWS-based data engineering projects. What this Book will help me do Gain the skills to design centralized data lake solutions and manage them securely at scale. Develop expertise in crafting data pipelines with AWS's ETL technologies like Glue and EMR. Learn to implement and automate governance, orchestration, and monitoring for data platforms. Build high-performance data serving layers using AWS analytics tools like Redshift and QuickSight. Effectively plan and execute data migrations to AWS from on-premises infrastructure. Author(s) Trâm Ngọc Phạm, Gonzalo Herreros González, Viquar Khan, and Huda Nofal bring together years of collective experience in data engineering and AWS cloud solutions. Each author's deep knowledge and passion for cloud technology have shaped this book into a valuable resource, geared towards practical learning and real-world application. Their approach ensures readers are not just learning but building tangible, impactful solutions. Who is it for? This book is geared towards data engineers and big data professionals engaged in or transitioning to cloud-based environments, specifically on AWS. Ideal readers are those looking to optimize workflows and master AWS tools to create scalable, efficient solutions. The content assumes a basic familiarity with AWS concepts like IAM roles and a command-line interface, ensuring all examples are accessible yet meaningful for those seeking advancement in AWS data engineering.

Managing Data as a Product

2024-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andrea Gioia (Quantyca)

AI/ML Data Management Data Modelling data data-engineering

Discover how to transform your data architecture with the insights and techniques presented in Managing Data as a Product by Andrea Gioia. In this comprehensive guide, you'll explore how to design, implement, and maintain data-product-centered systems to meet modern demands, achieving scalable and sustainable data management tailored to your organization's needs. What this Book will help me do Understand the principles of data-product-centered architectures and their advantages. Learn to design, develop, and operate data products in production settings. Explore strategies to manage the lifecycle of data products efficiently. Gain insights into team topologies and data ownership for distributed systems. Discover data modeling techniques for AI-ready architectures. Author(s) Andrea Gioia is a renowned data architect and the creator of the Open Data Mesh Initiative. With over 20 years of experience, Andrea has successfully led complex data projects and is passionate about sharing his expertise. His writing is practical and driven by real-world challenges, aiming to equip engineers with actionable knowledge. Who is it for? This book is ideal for data engineers, software architects, and engineering leaders involved in shaping innovative data architectures. If you have foundational knowledge of data engineering and are eager to advance your expertise by adopting data-product principles, this book will suit your needs. It is for professionals aiming to modernize and optimize their approach to organizational data management.

Apache Airflow Best Practices

2024-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dylan Intorf , Dylan Storey , Kendrick van Doorn

Airflow AWS Cloud Computing GCP Python apache-airflow data data-engineering

"Apache Airflow Best Practices" is your go-to guide for mastering data workflow orchestration using Apache Airflow. This book introduces you to core concepts and features of Airflow and helps you efficiently design, deploy, and manage workflows. With detailed examples and hands-on tutorials, you'll learn how to tackle real-world challenges in data engineering. What this Book will help me do Understand and utilize the features and updates introduced in Apache Airflow 2.x. Design and implement robust, scalable, and efficient data pipelines and workflows. Learn best practices for deploying Apache Airflow in cloud environments such as AWS and GCP. Extend Airflow's functionality with custom plugins and advanced configuration. Monitor, maintain, and scale your Airflow deployment effectively for high availability. Author(s) Dylan Intorf, Dylan Storey, and Kendrick van Doorn are seasoned professionals in data engineering, data strategy, and software development. Between them, they bring decades of experience working in diverse industries like finance, tech, and life sciences. They bring their expertise into this practical guide to help practitioners understand and master Apache Airflow. Who is it for? This book is tailored for data professionals such as data engineers, scientists, and system administrators, offering valuable insights for new learners and experienced users. If you're starting with workflow orchestration, seeking to optimize your current Airflow implementation, or scaling efforts, this book aligns with your goals. Readers should have a basic knowledge of Python programming and data engineering principles.

Delta Lake: The Definitive Guide

2024-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Denny Lee (Databricks) , Prashanth Babu (Databricks) , Tristen Wentling (Databricks) , Scott Haines (Databricks)

Flink Delta Kafka Data Streaming Trino data data-engineering delta-lake storage-repositories

Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques. Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale. This book helps you: Understand key data reliability challenges and how Delta Lake solves them Explain the critical role of Delta transaction logs as a single source of truth Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino Architect data lakehouses with the medallion architecture Optimize Delta Lake performance with features like deletion vectors and liquid clustering

Data Engineering Best Practices

2024-10-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Richard J. Schiller , David Larochelle

Agile/Scrum AI/ML Analytics Big Data Cloud Computing ETL/ELT data data-engineering

Unlock the secrets to building scalable and efficient data architectures with 'Data Engineering Best Practices.' This book provides in-depth guidance on designing, implementing, and optimizing cloud-based data pipelines. You will gain valuable insights into best practices, agile workflows, and future-proof designs. What this Book will help me do Effectively plan and architect scalable data solutions leveraging cloud-first strategies. Master agile processes tailored to data engineering for improved project outcomes. Implement secure, efficient, and reliable data pipelines optimized for analytics and AI. Apply real-world design patterns and avoid common pitfalls in data flow and processing. Create future-ready data engineering solutions following industry-proven frameworks. Author(s) Richard J. Schiller and David Larochelle are seasoned data engineering experts with decades of experience crafting efficient and secure cloud-based infrastructures. Their collaborative writing distills years of real-world expertise into practical advice aimed at helping engineers succeed in a rapidly evolving field. Who is it for? This book is ideal for data engineers, ETL specialists, and big data professionals seeking to enhance their knowledge in cloud-based solutions. Some familiarity with data engineering, ETL pipelines, and big data technologies is helpful. It suits those keen on mastering advanced practices, improving agility, and developing efficient data pipelines. Perfect for anyone looking to future-proof their skills in data engineering.

Financial Data Engineering

2024-10-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tamer Khraisha

API Data Governance data data-engineering

Today, investment in financial technology and digital transformation is reshaping the financial landscape and generating many opportunities. Too often, however, engineers and professionals in financial institutions lack a practical and comprehensive understanding of the concepts, problems, techniques, and technologies necessary to build a modern, reliable, and scalable financial data infrastructure. This is where financial data engineering is needed. A data engineer developing a data infrastructure for a financial product possesses not only technical data engineering skills but also a solid understanding of financial domain-specific challenges, methodologies, data ecosystems, providers, formats, technological constraints, identifiers, entities, standards, regulatory requirements, and governance. This book offers a comprehensive, practical, domain-driven approach to financial data engineering, featuring real-world use cases, industry practices, and hands-on projects. You'll learn: The data engineering landscape in the financial sector Specific problems encountered in financial data engineering The structure, players, and particularities of the financial data domain Approaches to designing financial data identification and entity systems Financial data governance frameworks, concepts, and best practices The financial data engineering lifecycle from ingestion to production The varieties and main characteristics of financial data workflows How to build financial data pipelines using open source tools and APIs Tamer Khraisha, PhD, is a senior data engineer and scientific author with more than a decade of experience in the financial sector.

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

2024-09-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pavan Kumar Narayanan

AI/ML Airflow Analytics API AWS Azure Cloud Computing Data Analytics Data Quality GCP Kafka Microsoft +9 more

This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code. The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows. What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world. What You Will Learn Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure Who This Book Is For Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists

Elastic Stack 8.x Cookbook

2024-06-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Yazid Akadiri , Huage Chen

AI/ML Analytics ELK Kibana Logstash NLP Cyber Security data data-engineering elastic-stack-elk-stack elastic stack (elk stack) elasticsearch +1 more

Unlock the potential of the Elastic Stack with the "Elastic Stack 8.x Cookbook." This book provides over 80 hands-on recipes, guiding you through ingesting, processing, and visualizing data using Elasticsearch, Logstash, Kibana, and more. You'll also explore advanced features like machine learning and observability to create data-driven applications with ease. What this Book will help me do Implement a robust workflow for ingesting, transforming, and visualizing diverse datasets. Utilize Kibana to create insightful dashboards and visual analytics. Leverage Elastic Stack's AI capabilities, such as natural language processing and machine learning. Develop search solutions and integrate advanced features like vector search. Monitor and optimize your Elastic Stack deployments for performance and security. Author(s) Huage Chen and Yazid Akadiri are experienced professionals in the field of Elastic Stack. They bring years of practical experience in data engineering, observability, and software development. Huage and Yazid aim to provide a clear, practical pathway for both beginners and experienced users to get the most out of the Elastic Stack's capabilities. Who is it for? This book is perfect for developers, data engineers, and observability practitioners looking to harness the power of Elastic Stack. It caters to both beginners and experts, providing clear instructions to help readers understand and implement powerful data solutions. If you're working with search applications, data analysis, or system observability, this book is an ideal resource.

talk-data.com

Activity Trend

Top Events

Top Speakers

The Data Engineer's Guide to Microsoft Fabric

Data Engineering for Multimodal AI

Data Engineering with Azure Databricks

Google Cloud Certified Professional Data Engineer Certification Guide

Practical Data Engineering with Apache Projects: Solving Everyday Data Challenges with Spark, Iceberg, Kafka, Flink, and More

Data Engineering for Beginners

AWS Certified Data Engineer Associate Study Guide

Data Engineering for Cybersecurity

Data Engineering Design Patterns

Accelerating Data Pipeline Development

Databricks Certified Data Engineer Associate Study Guide

Snowflake Data Engineering

Data Engineering with AWS Cookbook

Managing Data as a Product

Apache Airflow Best Practices

Delta Lake: The Definitive Guide

Data Engineering Best Practices

Financial Data Engineering

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

Elastic Stack 8.x Cookbook