Big Data

Engineering Lakehouses with Open Table Formats

2025-12-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dipankar Mazumdar , Vinoth Govindarajan (Apple)

Airflow Flink Data Lakehouse Data Management dbt Delta Hudi Iceberg Python Spark data data-engineering +2 more

Engineering Lakehouses with Open Table Formats introduces the architecture and capabilities of open table formats like Apache Iceberg, Apache Hudi, and Delta Lake. The book guides you through the design, implementation, and optimization of lakehouses that can handle modern data processing requirements effectively with real-world practical insights. What this Book will help me do Understand the fundamentals of open table formats and their benefits in lakehouse architecture. Learn how to implement performant data processing using tools like Apache Spark and Flink. Master advanced topics like indexing, partitioning, and interoperability between data formats. Explore data lifecycle management and integration with frameworks like Apache Airflow and dbt. Build secure lakehouses with regulatory compliance using best practices detailed in the book. Author(s) Dipankar Mazumdar and Vinoth Govindarajan are seasoned professionals with extensive experience in big data processing and software architecture. They bring their expertise from working with data lakehouses and are known for their ability to explain complex technical concepts clearly. Their collaborative approach brings valuable insights into the latest trends in data management. Who is it for? This book is ideal for data engineers, architects, and software professionals aiming to master modern lakehouse architectures. If you are familiar with data lakes or warehouses and wish to transition to an open data architectural design, this book is suited for you. Readers should have basic knowledge of databases, Python, and Apache Spark for the best experience.

Data Engineering for Beginners

2025-11-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Chisom Nwokwu

Cloud Computing Data Engineering Data Governance Data Quality NoSQL Cyber Security data data-engineering

A hands-on technical and industry roadmap for aspiring data engineers In Data Engineering for Beginners, big data expert Chisom Nwokwu delivers a beginner-friendly handbook for everyone interested in the fundamentals of data engineering. Whether you're interested in starting a rewarding, new career as a data analyst, data engineer, or data scientist, or seeking to expand your skillset in an existing engineering role, Nwokwu offers the technical and industry knowledge you need to succeed. The book explains: Database fundamentals, including relational and noSQL databases Data warehouses and data lakes Data pipelines, including info about batch and stream processing Data quality dimensions Data security principles, including data encryption Data governance principles and data framework Big data and distributed systems concepts Data engineering on the cloud Essential skills and tools for data engineering interviews and jobs Data Engineering for Beginners offers an easy-to-read roadmap on a seemingly complicated and intimidating subject. It addresses the topics most likely to cause a beginning data engineer to stumble, clearly explaining key concepts in an accessible way. You'll also find: A comprehensive glossary of data engineering terms Common and practical career paths in the data engineering industry An introduction to key cloud technologies and services you may encounter early in your data engineering career Perfect for practicing and aspiring data analysts, data scientists, and data engineers, Data Engineering for Beginners is an effective and reliable starting point for learning an in-demand skill. It's a powerful resource for everyone hoping to expand their data engineering Skillset and upskill in the big data era.

Advances in Artificial Intelligence Applications in Industrial and Systems Engineering

2025-09-23 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Gavriel Salvendy , Vincent Duffy , Waldemar Karwowski

AI/ML Analytics Cloud Computing Computer Science Data Analytics NLP Cyber Security ai-ml artificial-intelligence-ai artificial intelligence (ai) data

Comprehensive guide offering actionable strategies for enhancing human-centered AI, efficiency, and productivity in industrial and systems engineering through the power of AI. Advances in Artificial Intelligence Applications in Industrial and Systems Engineering is the first book in the Advances in Industrial and Systems Engineering series, offering insights into AI techniques, challenges, and applications across various industrial and systems engineering (ISE) domains. Not only does the book chart current AI trends and tools for effective integration, but it also raises pivotal ethical concerns and explores the latest methodologies, tools, and real-world examples relevant to today’s dynamic ISE landscape. Readers will gain a practical toolkit for effective integration and utilization of AI in system design and operation. The book also presents the current state of AI across big data analytics, machine learning, artificial intelligence tools, cloud-based AI applications, neural-based technologies, modeling and simulation in the metaverse, intelligent systems engineering, and more, and discusses future trends. Written by renowned international contributors for an international audience, Advances in Artificial Intelligence Applications in Industrial and Systems Engineering includes information on: Reinforcement learning, computer vision and perception, and safety considerations for autonomous systems (AS) (NLP) topics including language understanding and generation, sentiment analysis and text classification, and machine translation AI in healthcare, covering medical imaging and diagnostics, drug discovery and personalized medicine, and patient monitoring and predictive analysis Cybersecurity, covering threat detection and intrusion prevention, fraud detection and risk management, and network security Social good applications including poverty alleviation and education, environmental sustainability, and disaster response and humanitarian aid. Advances in Artificial Intelligence Applications in Industrial and Systems Engineering is a timely, essential reference for engineering, computer science, and business professionals worldwide.

The Definitive Guide to OpenSearch

2025-09-02 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Soujanya Konka , Prashant Agrawal (AWS) , Jon Handler (AWS)

Analytics AWS data data-engineering search

Learn how to harness the power of OpenSearch effectively with 'The Definitive Guide to OpenSearch'. This book explores installation, configuration, query building, and visualization, guiding readers through practical use cases and real-world implementations. Whether you're building search experiences or analyzing data patterns, this guide equips you thoroughly. What this Book will help me do Understand core OpenSearch principles, architecture, and the mechanics of its search and analytics capabilities. Learn how to perform data ingestion, execute advanced queries, and produce insightful visualizations on OpenSearch Dashboards. Implement scaling strategies and optimum configurations for high-performance OpenSearch clusters. Explore real-world case studies that demonstrate OpenSearch applications in diverse industries. Gain hands-on experience through practical exercises and tutorials for mastering OpenSearch functionality. Author(s) Jon Handler, Soujanya Konka, and Prashant Agrawal, celebrated experts in search technologies and big data analysis, bring their years of experience at AWS and other domains to this book. Their collective expertise ensures that readers receive both core theoretical knowledge and practical applications to implement directly. Who is it for? This book is aimed at developers, data professionals, engineers, and systems operators who work with search systems or analytics platforms. It is especially suitable for individuals in roles handling large-scale data, who want to improve their skills or deploy OpenSearch in production environments. Early learners and seasoned experts alike will find valuable insights.

Handbook of Decision Analysis, 2nd Edition

2025-05-28 · O'Reilly Data Science Books O'Reilly Amazon

book

by Gregory S. Parnell , Steven N. Tani , Eric Specking , Eric R. Johnson , Terry A. Bresnick

Analytics Data Analytics Data Science Microsoft business-intelligence data data-science prescriptive-analytics

Qualitative and quantitative techniques to apply decision analysis to real-world decision problems, supported by sound mathematics, best practices, soft skills, and more With substantive illustrations based on the authors’ personal experiences throughout, Handbook of Decision Analysis describes the philosophy, knowledge, science, and art of decision analysis. Key insights from decision analysis applications and behavioral decision analysis research are presented, and numerous decision analysis textbooks, technical books, and research papers are referenced for comprehensive coverage. This book does not introduce new decision analysis mathematical theory, but rather ensures the reader can understand and use the most common mathematics and best practices, allowing them to apply rigorous decision analysis with confidence. The material is supported by examples and solution steps using Microsoft Excel and includes many challenging real-world problems. Given the increase in the availability of data due to the development of products that deliver huge amounts of data, and the development of data science techniques and academic programs, a new theme of this Second Edition is the use of decision analysis techniques with big data and data analytics. Written by a team of highly qualified professionals and academics, Handbook of Decision Analysis includes information on: Behavioral decision-making insights, decision framing opportunities, collaboration with stakeholders, information assessment, and decision analysis modeling techniques Principles of value creation through designing alternatives, clear value/risk tradeoffs, and decision implementation Qualitative and quantitative techniques for each key decision analysis task, as opposed to presenting one technique for all decisions. Stakeholder analysis, decision hierarchies, and influence diagrams to frame descriptive, predictive, and prescriptive analytics decision problems to ensure implementation success Handbook of Decision Analysis is a highly valuable textbook, reference, and/or refresher for students and decision professionals in business, management science, engineering, engineering management, operations management, mathematics, and statistics who want to increase the breadth and depth of their technical and soft skills for success when faced with a professional or personal decision.

Time Series Analysis with Spark

2025-03-28 · O'Reilly Data Science Books O'Reilly Amazon

book

by Yoni Ramaswami

AI/ML Analytics Data Analytics Data Engineering Databricks GenAI Spark data data-science data-science-tasks statistics time-series

Time Series Analysis with Spark provides a practical introduction to leveraging Apache Spark and Databricks for time series analysis. You'll learn to prepare, model, and deploy robust and scalable time series solutions for real-world applications. From data preparation to advanced generative AI techniques, this guide prepares you to excel in big data analytics. What this Book will help me do Understand the core concepts and architectures of Apache Spark for time series analysis. Learn to clean, organize, and prepare time series data for big data environments. Gain expertise in choosing, building, and training various time series models tailored to specific projects. Master techniques to scale your models in production using Spark and Databricks. Explore the integration of advanced technologies such as generative AI to enhance predictions and derive insights. Author(s) Yoni Ramaswami, a Senior Solutions Architect at Databricks, has extensive experience in data engineering and AI solutions. With a focus on creating innovative big data and AI strategies across industries, Yoni authored this book to empower professionals to efficiently handle time series data. Yoni's approachable style ensures that both foundational concepts and advanced techniques are accessible to readers. Who is it for? This book is ideal for data engineers, machine learning engineers, data scientists, and analysts interested in enhancing their expertise in time series analysis using Apache Spark and Databricks. Whether you're new to time series or looking to refine your skills, you'll find both foundational insights and advanced practices explained clearly. A basic understanding of Spark is helpful but not required.

Big Data, Data Mining and Data Science

2024-12-30 · O'Reilly Data Science Books O'Reilly Amazon

book

by George Dimitoglou , Hamid Arabnia , Leonidas Deligiannidis

Data Science data data-science

Through the application of cutting-edge techniques like Big Data, Data Mining, and Data Science, it is possible to extract insights from massive datasets. These methodologies are crucial in enabling informed decision-making and driving transformative advancements across many fields, industries, and domains. This book offers an overview of latest tools, methods and approaches while also highlighting their practical use through various applications and case studies.

Artificial Intelligence-Enabled Businesses

2024-12-24 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Sweta Dixit , Vishal Jain (Microsoft) , Mohit Maurya , Geetha Subramaniam

AI/ML Marketing ai-ml artificial-intelligence-ai artificial intelligence (ai) data

This book has a multidimensional perspective on AI solutions for business innovation and real-life case studies to achieve competitive advantage and drive growth in the evolving digital landscape. Artificial Intelligence-Enabled Businesses demonstrates how AI is a catalyst for change in business functional areas. Though still in the experimental phase, AI is instrumental in redefining the workforce, predicting consumer behavior, solving real-life marketing dynamics and modifications, recommending products and content, foreseeing demand, analyzing costs, strategizing, managing big data, enabling collaboration of cross-entities, and sparking new ethical, social and regulatory implications for business. Thus, AI can effectively guide the future of financial services, trading, mobile banking, last-mile delivery, logistics, and supply chain with a solution-oriented focus on discrete business problems. Furthermore, it is expected to educate leaders to act in an ever more accurate, complex, and sophisticated business environment with the combination of human and machine intelligence. The book offers effective, efficient, and strategically competent suggestions for handling new challenges and responsibilities and is aimed at leaders who wish to be more innovative. It covers the early stages of AI adoption by organizations across their functional areas and provides insightful guidance for practitioners in the suitable and timely adoption of AI. This book will greatly help to scale up AI by leveraging interdisciplinary collaboration with cross-functional, skill-diverse teams and result in a competitive advantage. Audience This book is for marketing professionals, organizational leaders, and researchers to leverage AI and new technologies across various business functions. It also fits the needs of academics, students, and trainers, providing insights, case studies, and practical strategies for driving growth in the rapidly evolving digital landscape.

Data Engineering with AWS Cookbook

2024-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Viquar Khan , Trâm Ngọc Phạm , Gonzalo Herreros González , Huda Nofal

Analytics Athena AWS Amazon EMR AWS Glue Cloud Computing Data Engineering Data Lake ETL/ELT QuickSight Redshift data +1 more

Data Engineering with AWS Cookbook serves as a comprehensive practical guide for building scalable and efficient data engineering solutions using AWS. With this book, you will master implementing data lakes, orchestrating data pipelines, and creating serving layers using AWS's robust services, such as Glue, EMR, Redshift, and Athena. With hands-on exercises and practical recipes, you will enhance your AWS-based data engineering projects. What this Book will help me do Gain the skills to design centralized data lake solutions and manage them securely at scale. Develop expertise in crafting data pipelines with AWS's ETL technologies like Glue and EMR. Learn to implement and automate governance, orchestration, and monitoring for data platforms. Build high-performance data serving layers using AWS analytics tools like Redshift and QuickSight. Effectively plan and execute data migrations to AWS from on-premises infrastructure. Author(s) Trâm Ngọc Phạm, Gonzalo Herreros González, Viquar Khan, and Huda Nofal bring together years of collective experience in data engineering and AWS cloud solutions. Each author's deep knowledge and passion for cloud technology have shaped this book into a valuable resource, geared towards practical learning and real-world application. Their approach ensures readers are not just learning but building tangible, impactful solutions. Who is it for? This book is geared towards data engineers and big data professionals engaged in or transitioning to cloud-based environments, specifically on AWS. Ideal readers are those looking to optimize workflows and master AWS tools to create scalable, efficient solutions. The content assumes a basic familiarity with AWS concepts like IAM roles and a command-line interface, ensuring all examples are accessible yet meaningful for those seeking advancement in AWS data engineering.

Intelligent Data Analytics for Bioinformatics and Biomedical Systems

2024-11-05 · O'Reilly Data Science Books O'Reilly Amazon

book

by Prasenjit Chatterjee , Korhan Cengiz , Neha Sharma

Analytics Data Analytics bioinformatics data data-science data-science-domains

The book analyzes the combination of intelligent data analytics with the intricacies of biological data that has become a crucial factor for innovation and growth in the fast-changing field of bioinformatics and biomedical systems. Intelligent Data Analytics for Bioinformatics and Biomedical Systems delves into the transformative nature of data analytics for bioinformatics and biomedical research. It offers a thorough examination of advanced techniques, methodologies, and applications that utilize intelligence to improve results in the healthcare sector. With the exponential growth of data in these domains, the book explores how computational intelligence and advanced analytic techniques can be harnessed to extract insights, drive informed decisions, and unlock hidden patterns from vast datasets. From genomic analysis to disease diagnostics and personalized medicine, the book aims to showcase intelligent approaches that enable researchers, clinicians, and data scientists to unravel complex biological processes and make significant strides in understanding human health and diseases. This book is divided into three sections, each focusing on computational intelligence and data sets in biomedical systems. The first section discusses the fundamental concepts of computational intelligence and big data in the context of bioinformatics. This section emphasizes data mining, pattern recognition, and knowledge discovery for bioinformatics applications. The second part talks about computational intelligence and big data in biomedical systems. Based on how these advanced techniques are utilized in the system, this section discusses how personalized medicine and precision healthcare enable treatment based on individual data and genetic profiles. The last section investigates the challenges and future directions of computational intelligence and big data in bioinformatics and biomedical systems. This section concludes with discussions on the potential impact of computational intelligence on addressing global healthcare challenges. Audience Intelligent Data Analytics for Bioinformatics and Biomedical Systems is primarily targeted to professionals and researchers in bioinformatics, genetics, molecular biology, biomedical engineering, and healthcare. The book will also suit academicians, students, and professionals working in pharmaceuticals and interpreting biomedical data.

Apache Spark for Machine Learning

2024-11-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Deepak Gowda

AI/ML Computer Science Spark apache-spark data data-engineering

Dive into the power of Apache Spark as a tool for handling and processing big data required for machine learning. With this book, you will explore how to configure, execute, and deploy machine learning algorithms using Spark's scalable architecture and learn best practices for implementing real-world big data solutions. What this Book will help me do Understand the integration of Apache Spark with large-scale infrastructures for machine learning applications. Employ data processing techniques for preprocessing and feature engineering efficiently with Spark. Master the implementation of advanced supervised and unsupervised learning algorithms using Spark. Learn to deploy machine learning models within Spark ecosystems for optimized performance. Discover methods for analyzing big data trends and machine learning model tuning for improved accuracy. Author(s) The author, Deepak Gowda, is an experienced data scientist with over ten years of expertise in machine learning and big data. His career spans industries such as supply chain, cybersecurity, and more where he has utilized Apache Spark extensively. Deepak's teaching style is marked by clarity and practicality, making complex concepts approachable. Who is it for? Apache Spark for Machine Learning is tailored for data engineers, machine learning practitioners, and computer science students looking to advance their ability to process, analyze, and model using large datasets. If you're already familiar with basic machine learning and want to scale your solutions using Spark, this book is ideal for your studies and professional growth.

Data Engineering Best Practices

2024-10-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Richard J. Schiller , David Larochelle

Agile/Scrum AI/ML Analytics Cloud Computing Data Engineering ETL/ELT data data-engineering

Unlock the secrets to building scalable and efficient data architectures with 'Data Engineering Best Practices.' This book provides in-depth guidance on designing, implementing, and optimizing cloud-based data pipelines. You will gain valuable insights into best practices, agile workflows, and future-proof designs. What this Book will help me do Effectively plan and architect scalable data solutions leveraging cloud-first strategies. Master agile processes tailored to data engineering for improved project outcomes. Implement secure, efficient, and reliable data pipelines optimized for analytics and AI. Apply real-world design patterns and avoid common pitfalls in data flow and processing. Create future-ready data engineering solutions following industry-proven frameworks. Author(s) Richard J. Schiller and David Larochelle are seasoned data engineering experts with decades of experience crafting efficient and secure cloud-based infrastructures. Their collaborative writing distills years of real-world expertise into practical advice aimed at helping engineers succeed in a rapidly evolving field. Who is it for? This book is ideal for data engineers, ETL specialists, and big data professionals seeking to enhance their knowledge in cloud-based solutions. Some familiarity with data engineering, ETL pipelines, and big data technologies is helpful. It suits those keen on mastering advanced practices, improving agility, and developing efficient data pipelines. Perfect for anyone looking to future-proof their skills in data engineering.

Statistics for Data Science and Analytics

2024-09-04 · O'Reilly Data Science Books O'Reilly Amazon

book

by Janet Dobbins , Peter C. Bruce , Peter Gedeck

AI/ML Analytics Data Science NumPy Pandas Python Scikit-learn SciPy data data-science data-science-tasks statistics

Introductory statistics textbook with a focus on data science topics such as prediction, correlation, and data exploration Statistics for Data Science and Analytics is a comprehensive guide to statistical analysis using Python, presenting important topics useful for data science such as prediction, correlation, and data exploration. The authors provide an introduction to statistical science and big data, as well as an overview of Python data structures and operations. A range of statistical techniques are presented with their implementation in Python, including hypothesis testing, probability, exploratory data analysis, categorical variables, surveys and sampling, A/B testing, and correlation. The text introduces binary classification, a foundational element of machine learning, validation of statistical models by applying them to holdout data, and probability and inference via the easy-to-understand method of resampling and the bootstrap instead of using a myriad of “kitchen sink” formulas. Regression is taught both as a tool for explanation and for prediction. This book is informed by the authors’ experience designing and teaching both introductory statistics and machine learning at Statistics.com. Each chapter includes practical examples, explanations of the underlying concepts, and Python code snippets to help readers apply the techniques themselves. Statistics for Data Science and Analytics includes information on sample topics such as: Int, float, and string data types, numerical operations, manipulating strings, converting data types, and advanced data structures like lists, dictionaries, and sets Experiment design via randomizing, blinding, and before-after pairing, as well as proportions and percents when handling binary data Specialized Python packages like numpy, scipy, pandas, scikit-learn and statsmodels—the workhorses of data science—and how to get the most value from them Statistical versus practical significance, random number generators, functions for code reuse, and binomial and normal probability distributions Written by and for data science instructors, Statistics for Data Science and Analytics is an excellent learning resource for data science instructors prescribing a required intro stats course for their programs, as well as other students and professionals seeking to transition to the data science field.

Polars Cookbook

2024-08-23 · O'Reilly Data Science Books O'Reilly Amazon

book

by Yuki Kakegawa

Analytics Cloud Computing Data Analytics Microsoft NumPy Pandas Polars Python data data-science data-science-tools

Dive into the world of data analysis with the Polars Cookbook. This book, ideal for data professionals, covers practical recipes to manipulate, transform, and analyze data using the Python Polars library. You'll learn both the fundamentals and advanced techniques to build efficient and scalable data workflows. What this Book will help me do Master the basics of Python Polars including installation and setup. Perform complex data manipulation like pivoting, grouping, and joining. Handle large-scale time series data for accurate analysis. Understand data integration with libraries like pandas and numpy. Optimize workflows for both on-premise and cloud environments. Author(s) Yuki Kakegawa is an experienced data analytics consultant who has collaborated with companies such as Microsoft and Stanford Health Care. His passion for data led him to create this detailed guide on Polars. His expertise ensures you gain real-world, actionable insights from every chapter. Who is it for? This book is perfect for data analysts, engineers, and scientists eager to enhance their efficiency with Python Polars. If you are familiar with Python and tools like pandas but are new to Polars, this book will upskill you. Whether handling big data or optimizing code for performance, the Polars Cookbook has the guidance you need to succeed.

DuckDB in Action

2024-08-21 · O'Reilly Data Science Books O'Reilly Amazon

book

by Michael Simons , Mark Needham , Michael Hunger

Analytics API Cloud Computing CSV Data Analytics DuckDB DWH Java JSON Motherduck Neo4j Pandas +8 more

Dive into DuckDB and start processing gigabytes of data with ease—all with no data warehouse. DuckDB is a cutting-edge SQL database that makes it incredibly easy to analyze big data sets right from your laptop. In DuckDB in Action you’ll learn everything you need to know to get the most out of this awesome tool, keep your data secure on prem, and save you hundreds on your cloud bill. From data ingestion to advanced data pipelines, you’ll learn everything you need to get the most out of DuckDB—all through hands-on examples. Open up DuckDB in Action and learn how to: Read and process data from CSV, JSON and Parquet sources both locally and remote Write analytical SQL queries, including aggregations, common table expressions, window functions, special types of joins, and pivot tables Use DuckDB from Python, both with SQL and its "Relational"-API, interacting with databases but also data frames Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Pragmatic and comprehensive, DuckDB in Action introduces the DuckDB database and shows you how to use it to solve common data workflow problems. You won’t need to read through pages of documentation—you’ll learn as you work. Get to grips with DuckDB's unique SQL dialect, learning to seamlessly load, prepare, and analyze data using SQL queries. Extend DuckDB with both Python and built-in tools such as MotherDuck, and gain practical insights into building robust and automated data pipelines. About the Technology DuckDB makes data analytics fast and fun! You don’t need to set up a Spark or run a cloud data warehouse just to process a few hundred gigabytes of data. DuckDB is easily embeddable in any data analytics application, runs on a laptop, and processes data from almost any source, including JSON, CSV, Parquet, SQLite and Postgres. About the Book DuckDB in Action guides you example-by-example from setup, through your first SQL query, to advanced topics like building data pipelines and embedding DuckDB as a local data store for a Streamlit web app. You’ll explore DuckDB’s handy SQL extensions, get to grips with aggregation, analysis, and data without persistence, and use Python to customize DuckDB. A hands-on project accompanies each new topic, so you can see DuckDB in action. What's Inside Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Fast-paced SQL recap: From simple queries to advanced analytics About the Reader For data pros comfortable with Python and CLI tools. About the Authors Mark Needham is a blogger and video creator at @‌LearnDataWithMark. Michael Hunger leads product innovation for the Neo4j graph database. Michael Simons is a Java Champion, author, and Engineer at Neo4j. Quotes I use DuckDB every day, and I still learned a lot about how DuckDB makes things that are hard in most databases easy! - Jordan Tigani, Founder, MotherDuck An excellent resource! Unlocks possibilities for storing, processing, analyzing, and summarizing data at the edge using DuckDB. - Pramod Sadalage, Director, Thoughtworks Clear and accessible. A comprehensive resource for harnessing the power of DuckDB for both novices and experienced professionals. - Qiusheng Wu, Associate Professor, University of Tennessee Excellent! The book all we ducklings have been waiting for! - Gunnar Morling, Decodable

Big Data, 2nd Edition

2024-08-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hassan A. Karimi

data data-engineering geographic-information-system-gis geographic information system (gis) location-data

This revised new edition provides up-to-date knowledge on the latest developments related to these three fields for solving geoinformatics problems. There are seven new chapters, and each of them focuses on a separate real-world problem to which deep learning is applied.

Big Data on Kubernetes

2024-07-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Neylson Crepalde

Airflow BI Docker Kafka Kubernetes Python Spark SQL YAML data data-engineering streaming-messaging

Big Data on Kubernetes is your comprehensive guide to leveraging Kubernetes for scalable and efficient big data solutions. You will learn key concepts of Kubernetes architecture and explore tools like Apache Spark, Airflow, and Kafka. Gain hands-on experience building complete data pipelines to tackle real-world data challenges. What this Book will help me do Understand Kubernetes architecture and learn to deploy and manage clusters. Build and orchestrate big data pipelines using Spark, Airflow, and Kafka. Develop scalable and resilient data solutions with Docker and Kubernetes. Integrate and optimize data tools for real-time ingestion and processing. Apply concepts to hands-on projects addressing actual big data scenarios. Author(s) Neylson Crepalde is an experienced data specialist with extensive knowledge of Kubernetes and big data solutions. With deep practical experience, Neylson brings real-world insights to his writing. His approach emphasizes actionable guidance and relatable problem-solving with a strong foundation in scalable architecture. Who is it for? This book is ideal for data engineers, BI analysts, data team leaders, and tech managers familiar with Python, SQL, and YAML. Targeted at professionals seeking to develop or expand their expertise in scalable big data solutions, it provides practical insights into Docker, Kubernetes, and prominent big data tools.

Databricks Certified Associate Developer for Apache Spark Using Python

2024-06-14 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Saba Shah

Analytics API Data Engineering Data Science Databricks Python Spark SQL Data Streaming apache-spark data data-engineering

This book serves as the ultimate preparation for aspiring Databricks Certified Associate Developers specializing in Apache Spark. Deep dive into Spark's components, its applications, and exam techniques to achieve certification and expand your practical skills in big data processing and real-time analytics using Python. What this Book will help me do Deeply understand Apache Spark's core architecture for building big data applications. Write optimized SQL queries and leverage Spark DataFrame API for efficient data manipulation. Apply advanced Spark functions, including UDFs, to solve complex data engineering tasks. Use Spark Streaming capabilities to implement real-time and near-real-time processing solutions. Get hands-on preparation for the certification exam with mock tests and practice questions. Author(s) Saba Shah is a seasoned data engineer with extensive experience working at Databricks and leading data science teams. With her in-depth knowledge of big data applications and Spark, she delivers clear, actionable insights in this book. Her approach emphasizes practical learning and real-world applications. Who is it for? This book is ideal for data professionals such as engineers and analysts aiming to achieve Databricks certification. It is particularly helpful for individuals with moderate Python proficiency who are keen to understand Spark from scratch. If you're transitioning into big data roles, this guide prepares you comprehensively.

Data Engineering with Databricks Cookbook

2024-05-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pulkit Chadha

Cloud Computing Data Engineering Data Governance Databricks DataOps Delta DevOps Python Spark SQL Data Streaming data +1 more

In "Data Engineering with Databricks Cookbook," you'll learn how to efficiently build and manage data pipelines using Apache Spark, Delta Lake, and Databricks. This recipe-based guide offers techniques to transform, optimize, and orchestrate your data workflows. What this Book will help me do Master Apache Spark for data ingestion, transformation, and analysis. Learn to optimize data processing and improve query performance with Delta Lake. Manage streaming data processing with Spark Structured Streaming capabilities. Implement DataOps and DevOps workflows tailored for Databricks. Enforce data governance policies using Unity Catalog for scalable solutions. Author(s) Pulkit Chadha, the author of this book, is a Senior Solutions Architect at Databricks. With extensive experience in data engineering and big data applications, he brings practical insights into implementing modern data solutions. His educational writings focus on empowering data professionals with actionable knowledge. Who is it for? This book is ideal for data engineers, data scientists, and analysts who want to deepen their knowledge in managing and transforming large datasets. Readers should have an intermediate understanding of SQL, Python programming, and basic data architecture concepts. It is especially well-suited for professionals working with Databricks or similar cloud-based data platforms.

Nonparametric Statistical Methods Using R, 2nd Edition

2024-05-20 · O'Reilly Data Science Books O'Reilly Amazon

book

by John Kloke , Joseph McKean

data data-science data-science-tasks statistics

This thoroughly updated and expanded second edition covers traditional nonparametric methods and rank-based analyses. Two new chapters covering multivariate analyses and big data have been added. Core classical nonparametrics chapters on one- and two-sample problems have been expanded

talk-data.com

Activity Trend

Top Events

Top Speakers

Engineering Lakehouses with Open Table Formats

Data Engineering for Beginners

Advances in Artificial Intelligence Applications in Industrial and Systems Engineering

The Definitive Guide to OpenSearch

Handbook of Decision Analysis, 2nd Edition

Time Series Analysis with Spark

Big Data, Data Mining and Data Science

Artificial Intelligence-Enabled Businesses

Data Engineering with AWS Cookbook

Intelligent Data Analytics for Bioinformatics and Biomedical Systems

Apache Spark for Machine Learning

Data Engineering Best Practices

Statistics for Data Science and Analytics

Polars Cookbook

DuckDB in Action

Big Data, 2nd Edition

Big Data on Kubernetes

Databricks Certified Associate Developer for Apache Spark Using Python

Data Engineering with Databricks Cookbook

Nonparametric Statistical Methods Using R, 2nd Edition