Python

The Data Engineer's Guide to Microsoft Fabric

2027-05-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian Henrik Reich (twoday Data & AI)

Data Engineering Data Lakehouse Databricks ETL/ELT Microsoft Fabric Spark SQL Data Streaming analytics-platforms data data-science +1 more

Modern data engineering is evolving; and with Microsoft Fabric, the entire data platform experience is being redefined. This essential book offers a fresh, hands-on approach to navigating this shift. Rather than being an introduction to features, this guide explains how Fabric's key components—Lakehouse, Warehouse, and Real-Time Intelligence—work under the hood and how to put them to use in realistic workflows. Written by Christian Henrik Reich, a data engineering expert with experience that extends from Databricks to Fabric, this book is a blend of foundational theory and practical implementation of lakehouse solutions in Fabric. You'll explore how engines like Apache Spark and Fabric Warehouse collaborate with Fabric's Real-Time Intelligence solution in an integrated platform, and how to build ETL/ELT pipelines that deliver on speed, accuracy, and scale. Ideal for both new and practicing data engineers, this is your entry point into the fabric of the modern data platform. Acquire a working knowledge of lakehouses, warehouses, and streaming in Fabric Build resilient data pipelines across real-time and batch workloads Apply Python, Spark SQL, T-SQL, and KQL within a unified platform Gain insight into architectural decisions that scale with data needs Learn actionable best practices for engineering clean, efficient, governed solutions

Practical Statistics for Data Scientists, 3rd Edition

2026-08-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andrew Bruce , Peter Bruce , Peter Gedeck

Data Science data data-science data-science-tasks statistics

Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. And many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you're familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

Data Engineering with Azure Databricks

2026-04-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Xenia Ireton , Tonya Chernyshova , Dmitry Foshin , Dmitry Anoshin

AI/ML Airflow Analytics Azure ADF Azure DevOps CI/CD Cloud Computing Data Engineering Data Governance Data Lakehouse Databricks +11 more

Master end-to-end data engineering on Azure Databricks. From data ingestion and Delta Lake to CI/CD and real-time streaming, build secure, scalable, and performant data solutions with Spark, Unity Catalog, and ML tools. Key Features Build scalable data pipelines using Apache Spark and Delta Lake Automate workflows and manage data governance with Unity Catalog Learn real-time processing and structured streaming with practical use cases Implement CI/CD, DevOps, and security for production-ready data solutions Explore Databricks-native ML, AutoML, and Generative AI integration Book Description "Data Engineering with Azure Databricks" is your essential guide to building scalable, secure, and high-performing data pipelines using the powerful Databricks platform on Azure. Designed for data engineers, architects, and developers, this book demystifies the complexities of Spark-based workloads, Delta Lake, Unity Catalog, and real-time data processing. Beginning with the foundational role of Azure Databricks in modern data engineering, you’ll explore how to set up robust environments, manage data ingestion with Auto Loader, optimize Spark performance, and orchestrate complex workflows using tools like Azure Data Factory and Airflow. The book offers deep dives into structured streaming, Delta Live Tables, and Delta Lake’s ACID features for data reliability and schema evolution. You’ll also learn how to manage security, compliance, and access controls using Unity Catalog, and gain insights into managing CI/CD pipelines with Azure DevOps and Terraform. With a special focus on machine learning and generative AI, the final chapters guide you in automating model workflows, leveraging MLflow, and fine-tuning large language models on Databricks. Whether you're building a modern data lakehouse or operationalizing analytics at scale, this book provides the tools and insights you need. What you will learn Set up a full-featured Azure Databricks environment Implement batch and streaming ingestion using Auto Loader Optimize Spark jobs with partitioning and caching Build real-time pipelines with structured streaming and DLT Manage data governance using Unity Catalog Orchestrate production workflows with jobs and ADF Apply CI/CD best practices with Azure DevOps and Git Secure data with RBAC, encryption, and compliance standards Use MLflow and Feature Store for ML pipelines Build generative AI applications in Databricks Who this book is for This book is for data engineers, solution architects, cloud professionals, and software engineers seeking to build robust and scalable data pipelines using Azure Databricks. Whether you're migrating legacy systems, implementing a modern lakehouse architecture, or optimizing data workflows for performance, this guide will help you leverage the full power of Databricks on Azure. A basic understanding of Python, Spark, and cloud infrastructure is recommended.

Data Contracts in Practice

2026-02-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ryan Collingwood

Data Contracts Data Governance Data Quality JSON SQL YAML data data-engineering

In 'Data Contracts in Practice', Ryan Collingwood provides a detailed guide to managing and formalizing data responsibilities within organizations. Through practical examples and real-world use cases, you'll learn how to systematically address data quality, governance, and integration challenges using data contracts. What this Book will help me do Learn to identify and formalize expectations in data interactions, improving clarity among teams. Master implementation techniques to ensure data consistency and quality across critical business processes. Understand how to effectively document and deploy data contracts to bolster data governance. Explore solutions for proactively addressing and managing data changes and requirements. Gain real-world skills through practical examples using technologies like Python, SQL, JSON, and YAML. Author(s) Ryan Collingwood is a seasoned expert with over 20 years of experience in product management, data analysis, and software development. His holistic techno-social approach, designed to address both technical and organizational challenges, brings a unique perspective to improving data processes. Ryan's writing is informed by his extensive hands-on experience and commitment to enabling robust data ecosystems. Who is it for? This book is ideal for data engineers, software developers, and business analysts working to enhance organizational data integration. Professionals with a familiarity of system design, JSON, and YAML will find it particularly beneficial. Enterprise architects and leadership roles looking to understand data contract implementation and their business impacts will also greatly benefit. Basic understanding of Python and SQL is recommended to maximize learning.

Engineering Lakehouses with Open Table Formats

2025-12-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dipankar Mazumdar , Vinoth Govindarajan (Apple)

Airflow Flink Big Data Data Lakehouse Data Management dbt Delta Hudi Iceberg Spark data data-engineering +2 more

Engineering Lakehouses with Open Table Formats introduces the architecture and capabilities of open table formats like Apache Iceberg, Apache Hudi, and Delta Lake. The book guides you through the design, implementation, and optimization of lakehouses that can handle modern data processing requirements effectively with real-world practical insights. What this Book will help me do Understand the fundamentals of open table formats and their benefits in lakehouse architecture. Learn how to implement performant data processing using tools like Apache Spark and Flink. Master advanced topics like indexing, partitioning, and interoperability between data formats. Explore data lifecycle management and integration with frameworks like Apache Airflow and dbt. Build secure lakehouses with regulatory compliance using best practices detailed in the book. Author(s) Dipankar Mazumdar and Vinoth Govindarajan are seasoned professionals with extensive experience in big data processing and software architecture. They bring their expertise from working with data lakehouses and are known for their ability to explain complex technical concepts clearly. Their collaborative approach brings valuable insights into the latest trends in data management. Who is it for? This book is ideal for data engineers, architects, and software professionals aiming to master modern lakehouse architectures. If you are familiar with data lakes or warehouses and wish to transition to an open data architectural design, this book is suited for you. Readers should have basic knowledge of databases, Python, and Apache Spark for the best experience.

Hands-On Software Engineering with Python - Second Edition

2025-12-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Brian Allbee

Agile/Scrum CI/CD Cloud Computing Data Modelling Docker GitHub Pydantic programming-languages software-development

Grow your software engineering discipline, incorporating and mastering design, development, testing, and deployment best practices examples in a realistic Python project structure. Key Features Understand what makes Software Engineering a discipline, distinct from basic programming Gain practical insight into updating, refactoring, and scaling an existing Python system Implement robust testing, CI/CD pipelines, and cloud-ready architecture decisions Book Description Software engineering is more than coding; it’s the strategic design and continuous improvement of systems that serve real-world needs. This newly updated second edition of Hands-On Software Engineering with Python expands on its foundational approach to help you grow into a senior or staff-level engineering role. Fully revised for today’s Python ecosystem, this edition includes updated tooling, practices, and architectural patterns. You’ll explore key changes across five minor Python versions, examine new features like dataclasses and type hinting, and evaluate modern tools such as Poetry, pytest, and GitHub Actions. A new chapter introduces high-performance computing in Python, and the entire development process is enhanced with cloud-readiness in mind. You’ll follow a complete redesign and refactor of a multi-tier system from the first edition, gaining insight into how software evolves—and what it takes to do that responsibly. From system modeling and SDLC phases to data persistence, testing, and CI/CD automation, each chapter builds your engineering mindset while updating your hands-on skills. By the end of this book, you'll have mastered modern Python software engineering practices and be equipped to revise and future-proof complex systems with confidence. What you will learn Distinguish software engineering from general programming Break down and apply each phase of the SDLC to Python systems Create system models to plan architecture before writing code Apply Agile, Scrum, and other modern development methodologies Use dataclasses, pydantic, and schemas for robust data modeling Set up CI/CD pipelines with GitHub Actions and cloud build tools Write and structure unit, integration, and end-to-end tests Evaluate and integrate tools like Poetry, pytest, and Docker Who this book is for This book is for Python developers with a basic grasp of software development who want to grow into senior or staff-level engineering roles. It’s ideal for professionals looking to deepen their understanding of software architecture, system modeling, testing strategies, and cloud-aware development. Familiarity with core Python programming is required, as the book focuses on applying engineering principles to maintain, extend, and modernize real-world systems.

Pro Oracle GoldenGate 23ai for the DBA: Powering the Foundation of Data Integration and AI

2025-11-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bobby Curtis

AI/ML API AWS Azure BigQuery Cloud Computing GCP Oracle RAG Snowflake data data-engineering +2 more

Transform your data replication strategy into a competitive advantage with Oracle GoldenGate 23ai. This comprehensive guide delivers the practical knowledge DBAs and architects need to implement, optimize , and scale Oracle GoldenGate 23ai in production environments. Written by Oracle ACE Director Bobby Curtis, it blends deep technical expertise with real-world business insights from hundreds of implementations across manufacturing, financial services, and technology sectors. Beyond traditional replication, this book explores the groundbreaking capabilities that make GoldenGate 23ai essential for modern AI initiatives. Learn how to implement real-time vector replication for RAG systems, integrate with cloud platforms like GCP and Snowflake, and automate deployments using REST APIs and Python. Each chapter offers proven strategies to deliver measurable ROI while reducing operational risk. Whether you're upgrading from Classic GoldenGate , deploying your first cloud data pipeline, or building AI-ready data architectures, this book provides the strategic guidance and technical depth to succeed. With Bobby's signature direct approach, you'll avoid common pitfalls and implement best practices that scale with your business. What You Will Learn Master the microservices architecture and new capabilities of Oracle GoldenGate 23ai Implement secure, high-performance data replication across Oracle, PostgreSQL, and cloud databases Configure vector replication for AI and machine learning workloads, including RAG systems Design and build multi-master replication models with automatic conflict resolution Automate deployments and management using RESTful APIs and Python Optimize performance for sub-second replication lag in production environments Secure your replication environment with enterprise-grade features and compliance Upgrade from Classic to Microservices architecture with zero downtime Integrate with cloud platforms including OCI, GCP, AWS, and Azure Implement real-time data pipelines to BigQuery , Snowflake, and other cloud targets Navigate Oracle licensing models and optimize costs Who This Book Is For Database administrators, architects, and IT leaders working with Oracle GoldenGate —whether deploying for the first time, migrating from Classic architecture, or enabling AI-driven replication—will find actionable guidance on implementation, performance tuning, automation, and cloud integration. Covers unidirectional and multi-master replication and is packed with real-world use cases.

Unlocking dbt: Design and Deploy Transformations in Your Cloud Data Warehouse

2025-09-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dustin Dorsey (Onix) , Cameron Cyr

CI/CD Cloud Computing dbt DWH Git Modern Data Stack SQL data data-engineering data-warehouse storage-repositories

Master the art of data transformation with the second edition of this trusted guide to dbt. Building on the foundation of the first edition, this updated volume offers a deeper, more comprehensive exploration of dbt’s capabilities—whether you're new to the tool or looking to sharpen your skills. It dives into the latest features and techniques, equipping you with the tools to create scalable, maintainable, and production-ready data transformation pipelines. Unlocking dbt, Second Edition introduces key advancements, including the semantic layer, which allows you to define and manage metrics at scale, and dbt Mesh, empowering organizations to orchestrate decentralized data workflows with confidence. You’ll also explore more advanced testing capabilities, expanded CI/CD and deployment strategies, and enhancements in documentation—such as the newly introduced dbt Catalog. As in the first edition, you’ll learn how to harness dbt’s power to transform raw data into actionable insights, while incorporating software engineering best practices like code reusability, version control, and automated testing. From configuring projects with the dbt Platform or open source dbt to mastering advanced transformations using SQL and Jinja, this book provides everything you need to tackle real-world challenges effectively. What You Will Learn Understand dbt and its role in the modern data stack Set up projects using both the cloud-hosted dbt Platform and open source project Connect dbt projects to cloud data warehouses Build scalable models in SQL and Python Configure development, testing, and production environments Capture reusable logic with Jinja macros Incorporate version control with your data transformation code Seamlessly connect your projects using dbt Mesh Build and manage a semantic layer using dbt Deploy dbt using CI/CD best practices Who This Book Is For Current and aspiring data professionals, including architects, developers, analysts, engineers, data scientists, and consultants who are beginning the journey of using dbt as part of their data pipeline’s transformation layer. Readers should have a foundational knowledge of writing basic SQL statements, development best practices, and working with data in an analytical context such as a data warehouse.

Building Neo4j-Powered Applications with LLMs

2025-06-20 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ravindranatha Anthapu , Siddhant Agarwal

AI/ML Cloud Computing GCP GenAI Java LLM Neo4j data data-engineering graph-databases

Dive into building applications that combine the power of Large Language Models (LLMs) with Neo4j knowledge graphs, Haystack, and Spring AI to deliver intelligent, data-driven recommendations and search outcomes. This book provides actionable insights and techniques to create scalable, robust solutions by leveraging the best-in-class frameworks and a real-world project-oriented approach. What this Book will help me do Understand how to use Neo4j to build knowledge graphs integrated with LLMs for enhanced data insights. Develop skills in creating intelligent search functionalities by combining Haystack and vector-based graph techniques. Learn to design and implement recommendation systems using LangChain4j and Spring AI frameworks. Acquire the ability to optimize graph data architectures for LLM-driven applications. Gain proficiency in deploying and managing applications on platforms like Google Cloud for scalability. Author(s) Ravindranatha Anthapu, a Principal Consultant at Neo4j, and Siddhant Agarwal, a Google Developer Expert in Generative AI, bring together their vast experience to offer practical implementations and cutting-edge techniques in this book. Their combined expertise in Neo4j, graph technology, and real-world AI applications makes them authoritative voices in the field. Who is it for? Designed for database developers and data scientists, this book caters to professionals aiming to leverage the transformational capabilities of knowledge graphs alongside LLMs. Readers should have a working knowledge of Python and Java as well as familiarity with Neo4j and the Cypher query language. If you're looking to enhance search or recommendation functionalities through state-of-the-art AI integrations, this book is for you.

Snowflake Recipes: A Problem-Solution Approach to Implementing Modern Data Pipelines

2024-12-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by John Eipe , Dillon Dayton

Agile/Scrum AI/ML AWS Azure BigQuery Cloud Computing Data Governance GCP Microsoft Cyber Security Snowflake SQL +3 more

Explore Snowflake’s core concepts and unique features that differentiates it from industry competitors, such as, Azure Synapse and Google BigQuery. This book provides recipes for architecting and developing modern data pipelines on the Snowflake data platform by employing progressive techniques, agile practices, and repeatable strategies. You’ll walk through step-by-step instructions on ready-to-use recipes covering a wide range of the latest development topics. Then build scalable development pipelines and solve specific scenarios common to all modern data platforms, such as, data masking, object tagging, data monetization, and security best practices. Throughout the book you’ll work with code samples for Amazon Web Services, Microsoft Azure, and Google Cloud Platform. There’s also a chapter devoted to solving machine learning problems with Snowflake. Authors Dillon Dayton and John Eipe are both Snowflake SnowPro Core certified, specializing in data and digital services, and understand the challenges of finding the right solution to complex problems. The recipes in this book are based on real world use cases and examples designed to help you provide quality, performant, and secured data to solve business initiatives. What You’ll Learn Handle structured and un- structured data in Snowflake. Apply best practices and different options for data transformation. Understand data application development. Implement data sharing, data governance and security. Who This book Is For Data engineers, scientists and analysts moving into Snowflake, looking to build data apps. This book expects basic knowledge in Cloud (AWS or Azure or GCP), SQL and Python

Snowflake Data Engineering

2024-12-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Maja Ferle

AI/ML Analytics API CI/CD Cloud Computing Data Analytics Data Engineering Data Governance ELK Funnel GenAI Microsoft +5 more

A practical introduction to data engineering on the powerful Snowflake cloud data platform. Data engineers create the pipelines that ingest raw data, transform it, and funnel it to the analysts and professionals who need it. The Snowflake cloud data platform provides a suite of productivity-focused tools and features that simplify building and maintaining data pipelines. In Snowflake Data Engineering, Snowflake Data Superhero Maja Ferle shows you how to get started. In Snowflake Data Engineering you will learn how to: Ingest data into Snowflake from both cloud and local file systems Transform data using functions, stored procedures, and SQL Orchestrate data pipelines with streams and tasks, and monitor their execution Use Snowpark to run Python code in your pipelines Deploy Snowflake objects and code using continuous integration principles Optimize performance and costs when ingesting data into Snowflake Snowflake Data Engineering reveals how Snowflake makes it easy to work with unstructured data, set up continuous ingestion with Snowpipe, and keep your data safe and secure with best-in-class data governance features. Along the way, you’ll practice the most important data engineering tasks as you work through relevant hands-on examples. Throughout, author Maja Ferle shares design tips drawn from her years of experience to ensure your pipeline follows the best practices of software engineering, security, and data governance. About the Technology Pipelines that ingest and transform raw data are the lifeblood of business analytics, and data engineers rely on Snowflake to help them deliver those pipelines efficiently. Snowflake is a full-service cloud-based platform that handles everything from near-infinite storage, fast elastic compute services, inbuilt AI/ML capabilities like vector search, text-to-SQL, code generation, and more. This book gives you what you need to create effective data pipelines on the Snowflake platform. About the Book Snowflake Data Engineering guides you skill-by-skill through accomplishing on-the-job data engineering tasks using Snowflake. You’ll start by building your first simple pipeline and then expand it by adding increasingly powerful features, including data governance and security, adding CI/CD into your pipelines, and even augmenting data with generative AI. You’ll be amazed how far you can go in just a few short chapters! What's Inside Ingest data from the cloud, APIs, or Snowflake Marketplace Orchestrate data pipelines with streams and tasks Optimize performance and cost About the Reader For software developers and data analysts. Readers should know the basics of SQL and the Cloud. About the Author Maja Ferle is a Snowflake Subject Matter Expert and a Snowflake Data Superhero who holds the SnowPro Advanced Data Engineer and the SnowPro Advanced Data Analyst certifications. Quotes An incredible guide for going from zero to production with Snowflake. - Doyle Turner, Microsoft A must-have if you’re looking to excel in the field of data engineering. - Isabella Renzetti, Data Analytics Consultant & Trainer Masterful! Unlocks the true potential of Snowflake for modern data engineers. - Shankar Narayanan, Microsoft Valuable insights will enhance your data engineering skills and lead to cost-effective solutions. A must read! - Frédéric L’Anglais, Maxa Comprehensive, up-to-date and packed with real-life code examples. - Albert Nogués, Danone

Apache Airflow Best Practices

2024-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dylan Intorf , Dylan Storey , Kendrick van Doorn

Airflow AWS Cloud Computing Data Engineering GCP apache-airflow data data-engineering

"Apache Airflow Best Practices" is your go-to guide for mastering data workflow orchestration using Apache Airflow. This book introduces you to core concepts and features of Airflow and helps you efficiently design, deploy, and manage workflows. With detailed examples and hands-on tutorials, you'll learn how to tackle real-world challenges in data engineering. What this Book will help me do Understand and utilize the features and updates introduced in Apache Airflow 2.x. Design and implement robust, scalable, and efficient data pipelines and workflows. Learn best practices for deploying Apache Airflow in cloud environments such as AWS and GCP. Extend Airflow's functionality with custom plugins and advanced configuration. Monitor, maintain, and scale your Airflow deployment effectively for high availability. Author(s) Dylan Intorf, Dylan Storey, and Kendrick van Doorn are seasoned professionals in data engineering, data strategy, and software development. Between them, they bring decades of experience working in diverse industries like finance, tech, and life sciences. They bring their expertise into this practical guide to help practitioners understand and master Apache Airflow. Who is it for? This book is tailored for data professionals such as data engineers, scientists, and system administrators, offering valuable insights for new learners and experienced users. If you're starting with workflow orchestration, seeking to optimize your current Airflow implementation, or scaling efforts, this book aligns with your goals. Readers should have a basic knowledge of Python programming and data engineering principles.

Building Modern Data Applications Using Databricks Lakehouse

2024-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Will Girten

AI/ML CI/CD Cloud Computing Data Lakehouse Data Quality Databricks DataOps Delta Spark Terraform data data-engineering +2 more

This book, "Building Modern Data Applications Using Databricks Lakehouse," provides a comprehensive guide for data professionals to master the Databricks platform. You'll learn to effectively build, deploy, and monitor robust data pipelines with Databricks' Delta Live Tables, empowering you to manage and optimize cloud-based data operations effortlessly. What this Book will help me do Understand the foundations and concepts of Delta Live Tables and its role in data pipeline development. Learn workflows to process and transform real-time and batch data efficiently using the Databricks lakehouse architecture. Master the implementation of Unity Catalog for governance and secure data access in modern data applications. Deploy and automate data pipeline changes using CI/CD, leveraging tools like Terraform and Databricks Asset Bundles. Gain advanced insights in monitoring data quality and performance, optimizing cloud costs, and managing DataOps tasks effectively. Author(s) Will Girten, the author, is a seasoned Solutions Architect at Databricks with over a decade of experience in data and AI systems. With a deep expertise in modern data architectures, Will is adept at simplifying complex topics and translating them into actionable knowledge. His books emphasize real-time application and offer clear, hands-on examples, making learning engaging and impactful. Who is it for? This book is geared towards data engineers, analysts, and DataOps professionals seeking efficient strategies to implement and maintain robust data pipelines. If you have a basic understanding of Python and Apache Spark and wish to delve deeper into the Databricks platform for streamlining workflows, this book is tailored for you.

LLM Engineer's Handbook

2024-10-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Maxime Labonne (Liquid AI) , Paul Iusztin

AI/ML AWS LLM MLOps NLP RAG ai-ml artificial-intelligence-ai data generative-ai prompt-engineering

The "LLM Engineer's Handbook" is your comprehensive guide to mastering Large Language Models from concept to deployment. Written by leading experts, it combines theoretical foundations with practical examples to help you build, refine, and deploy LLM-powered solutions that solve real-world problems effectively and efficiently. What this Book will help me do Understand the principles and approaches for training and fine-tuning Large Language Models (LLMs). Apply MLOps practices to design, deploy, and monitor your LLM applications effectively. Implement advanced techniques such as retrieval-augmented generation (RAG) and preference alignment. Optimize inference for high performance, addressing low-latency and high availability for production systems. Develop robust data pipelines and scalable architectures for building modular LLM systems. Author(s) Paul Iusztin and Maxime Labonne are experienced AI professionals specializing in natural language processing and machine learning. With years of industry and academic experience, they are dedicated to making complex AI concepts accessible and actionable. Their collaborative authorship ensures a blend of theoretical rigor and practical insights tailored for modern AI practitioners. Who is it for? This book is tailored for AI engineers, NLP professionals, and LLM practitioners who wish to deepen their understanding of Large Language Models. Ideal readers possess some familiarity with Python, AWS, and general AI concepts. If you aim to apply LLMs to real-world scenarios or enhance your expertise in AI-driven systems, this handbook is designed for you.

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

2024-09-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pavan Kumar Narayanan

AI/ML Airflow Analytics API AWS Azure Cloud Computing Data Analytics Data Engineering Data Quality GCP Kafka +9 more

This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code. The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows. What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world. What You Will Learn Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure Who This Book Is For Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists

Full Stack FastAPI, React, and MongoDB - Second Edition

2024-08-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shubham Ranjan , Marko Aleksendrić , Rachelle Palmer , Shrey Batra

AI/ML API JavaScript JSON MongoDB React data data-engineering nosql-databases

Full Stack FastAPI, React, and MongoDB guides you step-by-step through creating web applications using the FARM stack. This hands-on resource teaches you how to integrate FastAPI, a modern Python framework, React for front-end development, and MongoDB for data storage to build and deploy powerful, scalable web applications. What this Book will help me do Master the essentials of MongoDB, including creating and managing document-based databases. Gain proficiency in building APIs using FastAPI and Python for robust backend systems. Develop dynamic frontends using React, integrating seamlessly with a FastAPI backend. Securely authenticate and authorize users using JSON Web Tokens in your applications. Explore advanced features like integrating AI models and building with Next.js for production-ready development. Author(s) Marko Aleksendrić, Shrey Batra, Rachelle Palmer, and Shubham Ranjan combine their expertise in web development and software engineering in this book. Together, they bring years of professional experience and a passion for teaching developers to create modern web applications effectively using cutting-edge tools. Who is it for? Intermediate web developers who possess foundational JavaScript and Python skills are the ideal audience for this book. If you want to advance your skills by mastering modern web application development with the FARM stack, this book will guide you comprehensively. With practical, real-world examples, it is designed for developers aiming to build production-grade applications.

Big Data on Kubernetes

2024-07-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Neylson Crepalde

Airflow BI Big Data Docker Kafka Kubernetes Spark SQL YAML data data-engineering streaming-messaging

Big Data on Kubernetes is your comprehensive guide to leveraging Kubernetes for scalable and efficient big data solutions. You will learn key concepts of Kubernetes architecture and explore tools like Apache Spark, Airflow, and Kafka. Gain hands-on experience building complete data pipelines to tackle real-world data challenges. What this Book will help me do Understand Kubernetes architecture and learn to deploy and manage clusters. Build and orchestrate big data pipelines using Spark, Airflow, and Kafka. Develop scalable and resilient data solutions with Docker and Kubernetes. Integrate and optimize data tools for real-time ingestion and processing. Apply concepts to hands-on projects addressing actual big data scenarios. Author(s) Neylson Crepalde is an experienced data specialist with extensive knowledge of Kubernetes and big data solutions. With deep practical experience, Neylson brings real-world insights to his writing. His approach emphasizes actionable guidance and relatable problem-solving with a strong foundation in scalable architecture. Who is it for? This book is ideal for data engineers, BI analysts, data team leaders, and tech managers familiar with Python, SQL, and YAML. Targeted at professionals seeking to develop or expand their expertise in scalable big data solutions, it provides practical insights into Docker, Kubernetes, and prominent big data tools.

Databricks Certified Associate Developer for Apache Spark Using Python

2024-06-14 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Saba Shah

Analytics API Big Data Data Engineering Data Science Databricks Spark SQL Data Streaming apache-spark data data-engineering

This book serves as the ultimate preparation for aspiring Databricks Certified Associate Developers specializing in Apache Spark. Deep dive into Spark's components, its applications, and exam techniques to achieve certification and expand your practical skills in big data processing and real-time analytics using Python. What this Book will help me do Deeply understand Apache Spark's core architecture for building big data applications. Write optimized SQL queries and leverage Spark DataFrame API for efficient data manipulation. Apply advanced Spark functions, including UDFs, to solve complex data engineering tasks. Use Spark Streaming capabilities to implement real-time and near-real-time processing solutions. Get hands-on preparation for the certification exam with mock tests and practice questions. Author(s) Saba Shah is a seasoned data engineer with extensive experience working at Databricks and leading data science teams. With her in-depth knowledge of big data applications and Spark, she delivers clear, actionable insights in this book. Her approach emphasizes practical learning and real-world applications. Who is it for? This book is ideal for data professionals such as engineers and analysts aiming to achieve Databricks certification. It is particularly helpful for individuals with moderate Python proficiency who are keen to understand Spark from scratch. If you're transitioning into big data roles, this guide prepares you comprehensively.

Data Engineering with Databricks Cookbook

2024-05-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pulkit Chadha

Big Data Cloud Computing Data Engineering Data Governance Databricks DataOps Delta DevOps Spark SQL Data Streaming data +1 more

In "Data Engineering with Databricks Cookbook," you'll learn how to efficiently build and manage data pipelines using Apache Spark, Delta Lake, and Databricks. This recipe-based guide offers techniques to transform, optimize, and orchestrate your data workflows. What this Book will help me do Master Apache Spark for data ingestion, transformation, and analysis. Learn to optimize data processing and improve query performance with Delta Lake. Manage streaming data processing with Spark Structured Streaming capabilities. Implement DataOps and DevOps workflows tailored for Databricks. Enforce data governance policies using Unity Catalog for scalable solutions. Author(s) Pulkit Chadha, the author of this book, is a Senior Solutions Architect at Databricks. With extensive experience in data engineering and big data applications, he brings practical insights into implementing modern data solutions. His educational writings focus on empowering data professionals with actionable knowledge. Who is it for? This book is ideal for data engineers, data scientists, and analysts who want to deepen their knowledge in managing and transforming large datasets. Readers should have an intermediate understanding of SQL, Python programming, and basic data architecture concepts. It is especially well-suited for professionals working with Databricks or similar cloud-based data platforms.

The Ultimate Guide to Snowpark

2024-05-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vivekanandan SS , Shankar Narayanan SGS

AI/ML Cloud Computing Data Engineering Data Science Snowflake SQL data data-engineering

The Ultimate Guide to Snowpark serves as a comprehensive resource to help you master the Snowflake Snowpark framework using Python. You'll learn how to manage data engineering, data science, and data applications in Snowpark, coupled with practical implementations and examples. By following this guide, you'll gain the skills needed to efficiently process and analyze data in the Snowflake Data Cloud. What this Book will help me do Master Snowpark with Python for data engineering, data science, and data application workloads. Develop and deploy robust data pipelines using Snowpark in Python. Design, implement, and produce machine learning models using Snowpark. Learn to monetize and operationalize Snowflake-native applications. Effectively adopt Snowpark in production for scalable, efficient data solutions. Author(s) Shankar Narayanan SGS and Vivekanandan SS are experienced professionals in data engineering and Snowflake technologies. Shankar has extensive experience in utilizing Snowflake Snowpark to manage and enhance data solutions. Vivekanandan brings expertise in the intersection of Python programming and cloud-based data processing. Together, their combined knowledge and approachable writing style make this book an invaluable resource to readers. Who is it for? This book is designed for data engineers, data scientists, developers, and seasoned data practitioners. Ideal candidates are those looking to expand their skills in implementing Snowpark solutions using Python. A prior understanding of SQL, Python programming, and familiarity with Snowflake is beneficial for readers to fully leverage the techniques presented.

talk-data.com

Activity Trend

Top Events

Top Speakers

The Data Engineer's Guide to Microsoft Fabric

Practical Statistics for Data Scientists, 3rd Edition

Data Engineering with Azure Databricks

Data Contracts in Practice

Engineering Lakehouses with Open Table Formats

Hands-On Software Engineering with Python - Second Edition

Pro Oracle GoldenGate 23ai for the DBA: Powering the Foundation of Data Integration and AI

Unlocking dbt: Design and Deploy Transformations in Your Cloud Data Warehouse

Building Neo4j-Powered Applications with LLMs

Snowflake Recipes: A Problem-Solution Approach to Implementing Modern Data Pipelines

Snowflake Data Engineering

Apache Airflow Best Practices

Building Modern Data Applications Using Databricks Lakehouse

LLM Engineer's Handbook

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

Full Stack FastAPI, React, and MongoDB - Second Edition

Big Data on Kubernetes

Databricks Certified Associate Developer for Apache Spark Using Python

Data Engineering with Databricks Cookbook

The Ultimate Guide to Snowpark