talk-data.com talk-data.com

Event

O'Reilly Data Engineering Books

2001-10-19 – 2027-05-25 Oreilly Visit website ↗

Activities tracked

299

Collection of O'Reilly books on Data Engineering.

Filtering by: Big Data ×

Sessions & talks

Showing 101–125 of 299 · Newest first

Search within this event →
Big Data Analytics with Hadoop 3

Big Data Analytics with Hadoop 3 is your comprehensive guide to understanding and leveraging the power of Apache Hadoop for large-scale data processing and analytics. Through practical examples, it introduces the tools and techniques necessary to integrate Hadoop with other popular frameworks, enabling efficient data handling, processing, and visualization. What this Book will help me do Understand the foundational components and features of Apache Hadoop 3 such as HDFS, YARN, and MapReduce. Gain the ability to integrate Hadoop with programming languages like Python and R for data analysis. Learn the skills to utilize tools such as Apache Spark and Apache Flink for real-time data analytics within the Hadoop ecosystem. Develop expertise in setting up a Hadoop cluster and performing analytics in cloud environments such as AWS. Master the process of building practical big data analytics pipelines for end-to-end data processing. Author(s) Sridhar Alla is a seasoned big data professional with extensive industry experience in building and deploying scalable big data analytics solutions. Known for his expertise in Hadoop and related ecosystems, Sridhar combines technical depth with clear communication in his writing, providing practical insights and hands-on knowledge. Who is it for? This book is tailored for data professionals, software engineers, and data scientists looking to expand their expertise in big data analytics using Hadoop 3. Whether you're an experienced developer or new to the big data ecosystem, this book provides the step-by-step guidance and practical examples needed to advance your skills and achieve your analytical goals.

A Deep Dive into NoSQL Databases: The Use Cases and Applications

A Deep Dive into NoSQL Databases: The Use Cases and Applications, Volume 109, the latest release in the Advances in Computers series first published in 1960, presents detailed coverage of innovations in computer hardware, software, theory, design and applications. In addition, it provides contributors with a medium in which they can explore their subjects in greater depth and breadth. This update includes sections on NoSQL and NewSQL databases for big data analytics and distributed computing, NewSQL databases and scalable in-memory analytics, NoSQL web crawler application, NoSQL Security, a Comparative Study of different In-Memory (No/New)SQL Databases, NoSQL Hands On-4 NoSQLs, the Hadoop Ecosystem, and more. Provides a very comprehensive, yet compact, book on the popular domain of NoSQL databases for IT professionals, practitioners and professors Articulates and accentuates big data analytics and how it gets simplified and streamlined by NoSQL database systems Sets a stimulating foundation with all the relevant details for NoSQL database researchers, developers and administrators

Modern Big Data Processing with Hadoop

Delve into the world of big data with 'Modern Big Data Processing with Hadoop.' This comprehensive guide introduces you to the powerful capabilities of Apache Hadoop and its ecosystem to solve data processing and analytics challenges. By the end, you will have mastered the techniques necessary to architect innovative, scalable, and efficient big data solutions. What this Book will help me do Master the principles of building an enterprise-level big data strategy with Apache Hadoop. Learn to integrate Hadoop with tools such as Apache Spark, Elasticsearch, and more for comprehensive solutions. Set up and manage your big data architecture, including deployment on cloud platforms with Apache Ambari. Develop real-time data pipelines and enterprise search solutions. Leverage advanced visualization tools like Apache Superset to make sense of data insights. Author(s) None R. Patil, None Kumar, and None Shindgikar are experienced big data professionals and accomplished authors. With years of hands-on experience in implementing and managing Apache Hadoop systems, they bring a depth of expertise to their writing. Their dedication lies in making complex technical concepts accessible while demonstrating real-world best practices. Who is it for? This book is designed for data professionals aiming to advance their expertise in big data solutions using Apache Hadoop. Ideal readers include engineers and project managers involved in data architecture and those aspiring to become big data architects. Some prior exposure to big data systems is beneficial to fully benefit from this book's insights and tutorials.

Cleaning Up the Data Lake with an Operational Data Hub

The data lake was once heralded as the answer to the flood of big data that arrived in a variety of structured and unstructured formats. But, due to the ease of integration and the lack of governance, data lakes in many companies have devolved into unusable data swamps. This short ebook shows you how to solve this problem using an Operational Data Hub (ODH) to collect, store, index, cleanse, harmonize, and master data of all shapes and formats. Gerhard Ungerer—CTO and co-founder of Random Bit LLC—explains how the ODH supports transactional integrity so that the hub can serve as integration point for enterprise applications. You’ll also learn how the ODH helps you leverage the investment in your data lake (or swamp), so that the data trapped there can finally be ingested, processed, and provisioned. With this ebook, you’ll learn how an ODH: Allows you to focus on categorizing data for easy and fast retrieval Provides flexible storage models, indexing support, query capabilities, security, and a governance framework Delivers flexible storage models; support for indexing, scripting, and automation; query capabilities; transactional integrity; and security Includes a governance model to help you access, ingest, harmonize, materialize, provision, and consume data

Gaining Data Agility with Multi-Model Databases

Most organizations realize that their future depends on the ability to quickly adapt to constant changes brought on by variable and complex environments. It's become increasingly clear that the core source behind these innovative solutions is data. Polyglot persistence refers to systems that provide many different types of data storage technologies to deal with this vast variability of data. Applications that need to access data from more than one store have to navigate an array of databases in a complex—and ultimately unsustainable—maze. One solution to this problem is readily available. In this ebook, consultant Joel Ruisi explains how a multi-model database enables you to take advantage of many different types of data models (and multiple schemas) in a single backend. With a multi-model database, companies can easily centralize, manage, and search all the data the IT system collects. The result is data agility: the ability to adapt to changing environments and serve users what they need when they need it. Through several detailed use cases, this ebook explains how multi-model databases enable you to: Store and manage multiple heterogeneous data sources Consolidate your data by bringing everything in "as is" Invisibly extend model features from one model to another Take a hybrid approach to analytical and operational data Enhance user search experience, including big data search Conduct queries across data models Offer SQL without relational constraints

Spark: The Definitive Guide

Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals. Youâ??ll explore the basic operations and common functions of Sparkâ??s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Sparkâ??s scalable machine-learning library. Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasetsâ??Sparkâ??s core APIsâ??through worked examples Dive into Sparkâ??s low-level APIs, RDDs, and execution of SQL and DataFrames Understand how Spark runs on a cluster Debug, monitor, and tune Spark clusters and applications Learn the power of Structured Streaming, Sparkâ??s stream-processing engine Learn how you can apply MLlib to a variety of problems, including classification or recommendation

Big Data Demystified

The full text downloaded to your computer With eBooks you can: search for key concepts, words and phrases make highlights and notes as you study share your notes with friends eBooks are downloaded to your computer and accessible either offline through the Bookshelf (available as a free download), available online and also via the iPad and Android apps. Upon purchase, you will receive via email the code and instructions on how to access this product. Time limit The eBooks products do not have an expiry date. You will continue to access your digital ebook products whilst you have your Bookshelf installed. 'Big Data' refers to a new class of data, to which 'big' doesn't quite do it justice. Much like an ocean is more than simply a deeper swimming pool, big data is fundamentally different to traditional data and needs a whole new approach. Packed with examples and case studies, this clear, comprehensive book will show you how to accumulate and utilise 'big data' in order to develop your business strategy. Big Data Demystified is your practical guide to help you draw deeper insights from the vast information at your fingertips; you will be able to understand customer motivations, speed up production lines, and even offer personalised experiences to each and every customer. With 20 years of industry experience, David Stephenson shows how big data can give you the best competitive edge, and why it is integral to the future of your business.

Complete Guide to Open Source Big Data Stack

See a Mesos-based big data stack created and the components used. You will use currently available Apache full and incubating systems. The components are introduced by example and you learn how they work together. In the Complete Guide to Open Source Big Data Stack, the author begins by creating a private cloud and then installs and examines Apache Brooklyn. After that, he uses each chapter to introduce one piece of the big data stack—sharing how to source the software and how to install it. You learn by simple example, step by step and chapter by chapter, as a real big data stack is created. The book concentrates on Apache-based systems and shares detailed examples of cloud storage, release management, resource management, processing, queuing, frameworks, data visualization, and more. What You’ll Learn Install a private cloud onto the local cluster using Apache cloud stack Source, install, and configure Apache: Brooklyn, Mesos, Kafka, and Zeppelin See how Brooklyn can be used to install Mule ESB on a cluster and Cassandra in the cloud Install and use DCOS for big data processing Use Apache Spark for big data stack data processing Who This Book Is For Developers, architects, IT project managers, database administrators, and others charged with developing or supporting a big data system. It is also for anyone interested in Hadoop or big data, and those experiencing problems with data size.

Learning Google BigQuery

If you're ready to untap the potential of data analytics in the cloud, 'Learning Google BigQuery' will take you from understanding foundational concepts to mastering advanced techniques of this powerful platform. Through hands-on examples, you'll learn how to query and analyze massive datasets efficiently, develop custom applications, and integrate your results seamlessly with other tools. What this Book will help me do Understand the fundamentals of Google Cloud Platform and how BigQuery operates within it. Migrate enterprise-scale data seamlessly into BigQuery for further analytics. Master SQL techniques for querying large-scale datasets in BigQuery. Enable real-time data analytics and visualization with tools like Tableau and Python. Learn to create dynamic datasets, manage partition tables and use BigQuery APIs effectively. Author(s) None Berlyant, None Haridass, and None Brown are specialists with years of experience in data science, big data platforms, and cloud technologies. They bring their expertise in data analytics and teaching to make advanced concepts accessible. Their hands-on approach and real-world examples ensure readers can directly apply the skills they acquire to practical scenarios. Who is it for? This book is tailored for developers, analysts, and data scientists eager to leverage cloud-based tools for handling and analyzing large-scale datasets. If you seek to gain hands-on proficiency in working with BigQuery or want to enhance your organization's data capabilities, this book is a fit. No prior BigQuery knowledge is needed, just a willingness to learn.

Expert Apache Cassandra Administration

Follow this handbook to build, configure, tune, and secure Apache Cassandra databases. Start with the installation of Cassandra and move on to the creation of a single instance, and then a cluster of Cassandra databases. Cassandra is increasingly a key player in many big data environments, and this book shows you how to use Cassandra with Apache Spark, a popular big data processing framework. Also covered are day-to-day topics of importance such as the backup and recovery of Cassandra databases, using the right compression and compaction strategies, and loading and unloading data. Expert Apache Cassandra Administration provides numerous step-by-step examples starting with the basics of a Cassandra database, and going all the way through backup and recovery, performance optimization, and monitoring and securing the data. The book serves as an authoritative and comprehensive guide to the building and management of simpleto complex Cassandra databases. The book: Takes you through building a Cassandra database from installation of the software and creation of a single database, through to complex clusters and data centers Provides numerous examples of actual commands in a real-life Cassandra environment that show how to confidently configure, manage, troubleshoot, and tune Cassandra databases Shows how to use the Cassandra configuration properties to build a highly stable, available, and secure Cassandra database that always operates at peak efficiency What You'll Learn Install the Cassandra software and create your first database Understand the Cassandra data model, and the internal architecture of a Cassandra database Create your own Cassandra cluster, step-by-step Run a Cassandra cluster on Docker Work with Apache Spark by connecting to a Cassandra database Deploy Cassandra clusters in your data center, or on Amazon EC2 instances Back up and restore mission-critical Cassandra databases Monitor, troubleshoot, and tune production Cassandra databases, and cut your spending on resources such as memory, servers, and storage Who This Book Is For Database administrators, developers, and architects who are looking for an authoritative and comprehensive single volume for all their Cassandra administration needs. Also for administrators who are tasked with setting up and maintaining highly reliable and high-performing Cassandra databases. An excellent choice for big data administrators, database administrators, architects, and developers who use Cassandra as their key data store, to support high volume online transactions, or as a decentralized, elastic data store.

PySpark Recipes: A Problem-Solution Approach with PySpark2

Quickly find solutions to common programming problems encountered while processing big data. Content is presented in the popular problem-solution format. Look up the programming problem that you want to solve. Read the solution. Apply the solution directly in your own code. Problem solved! PySpark Recipes covers Hadoop and its shortcomings. The architecture of Spark, PySpark, and RDD are presented. You will learn to apply RDD to solve day-to-day big data problems. Python and NumPy are included and make it easy for new learners of PySpark to understand and adopt the model. What You Will Learn Understand the advanced features of PySpark2 and SparkSQL Optimize your code Program SparkSQL with Python Use Spark Streaming and Spark MLlib with Python Perform graph analysis with GraphFrames Who This Book Is For Data analysts, Python programmers, big data enthusiasts

Mastering MongoDB 3.x

"Mastering MongoDB 3.x" is your comprehensive guide to mastering the world of MongoDB, the leading NoSQL database. This book equips you with both foundational and advanced skills to effectively design, develop, and manage MongoDB-powered applications. Discover how to build fault-tolerant systems and dive deep into database internals, deployment strategies, and much more. What this Book will help me do Gain expertise in advanced querying using indexing and data expressions for efficient data retrieval. Master MongoDB administration for both on-premise and cloud-based environments efficiently. Learn data sharding and replication techniques to ensure scalability and fault tolerance. Understand the intricacies of MongoDB internals, including performance optimization techniques. Leverage MongoDB for big data processing by integrating with complex data pipelines. Author(s) Alex Giamas is a seasoned database developer and administrator with strong expertise in NoSQL technologies, particularly MongoDB. With years of experience guiding teams on creating and optimizing database structures, Alex ensures clear and practical methods for learning the essential aspects of MongoDB. His writing focuses on actionable knowledge and practical solutions for modern database challenges. Who is it for? This book is perfect for database developers, system architects, and administrators who are already familiar with database concepts and are looking to deepen their knowledge in NoSQL databases, specifically MongoDB. Whether you're working on building web applications, scaling data systems, or ensuring fault tolerance, this book provides the guidance to optimize your database management skill set.

Introduction to GPUs for Data Analytics

Moore’s law has finally run out of steam for CPUs. The number of x86 cores that can be placed cost-effectively on a single chip has reached a practical limit, making higher densities prohibitively expensive for most applications. Fortunately, for big data analytics, machine learning, and database applications, a more capable and cost-effective alternative for scaling compute performance is already available: the graphics processing unit, or GPU. In this report, executives at Kinetica and Sierra Communications explain how incorporating GPUs is ideal for keeping pace with the relentless growth in streaming, complex, and large data confronting organizations today. Technology professionals, business analysts, and data scientists will learn how their organizations can begin implementing GPU-accelerated solutions either on premise or in the cloud. This report explores: How GPUs supplement CPUs to enable continued price/performance gains The many database and data analytics applications that can benefit from GPU acceleration Why GPU databases with user-defined functions (UDFs) can simplify and unify the machine learning/deep learning pipeline How GPU-accelerated databases can process streaming data from the Internet of Things and other sources in real time The performance advantage of GPU databases in demanding geospatial analytics applications How cognitive computing—the most compute-intensive application currently imaginable—is now within reach, using GPUs

Apache Spark 2.x Machine Learning Cookbook

This book is your gateway to mastering machine learning with Apache Spark 2.x. Through detailed hands-on recipes, you'll delve into building scalable ML models, optimizing big data processes, and enhancing project efficiency. Gain practical knowledge and explore real-world applications of recommendations, clustering, analytics, and more with Spark's powerful capabilities. What this Book will help me do Understand how to integrate Scala and Spark for effective machine learning development. Learn to create scalable recommendation engines using Spark. Master the development of clustering systems to organize unlabelled data at scale. Explore Spark libraries to implement efficient text analytics and search engines. Optimize large-scale data operations, tackling high-dimensional issues with Spark. Author(s) The team of authors brings expertise in machine learning, data science, and Spark technologies. Their combined industry experience and academic knowledge ensure the book is grounded in practical applications while offering theoretical insights. With clear explanations and a step-by-step approach, they aim to simplify complex concepts for developers and data scientists. Who is it for? This book is crafted for Scala developers familiar with machine learning concepts but seeking practical applications with Spark. If you have been implementing models but want to scale them and leverage Spark's robust ecosystem, this guide will serve you well. It is ideal for professionals seeking to deepen their skills in Spark and data science.

Kafka: The Definitive Guide

Every enterprise application creates data, whether it’s log messages, metrics, user activity, outgoing messages, or something else. And how to move all of this data becomes nearly as important as the data itself. If you’re an application architect, developer, or production engineer new to Apache Kafka, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds. Engineers from Confluent and LinkedIn who are responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. Through detailed examples, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer. Understand publish-subscribe messaging and how it fits in the big data ecosystem. Explore Kafka producers and consumers for writing and reading messages Understand Kafka patterns and use-case requirements to ensure reliable data delivery Get best practices for building data pipelines and applications with Kafka Manage Kafka in production, and learn to perform monitoring, tuning, and maintenance tasks Learn the most critical metrics among Kafka’s operational measurements Explore how Kafka’s stream delivery capabilities make it a perfect source for stream processing systems

Mastering Apache Storm

Mastering Apache Storm is your step-by-step guide to mastering real-time data streaming with this robust framework. You'll learn how to process big data efficiently and integrate Apache Storm with popular technologies like Kafka, HBase, and Redis to maximize its potential. This book walks you through from basic concepts to advanced implementations of Apache Storm in real-world scenarios. What this Book will help me do Understand the core features and operation of Apache Storm for real-time data streaming. Integrate Apache Storm with other Big Data frameworks like Kafka, HBase, Redis, and Hadoop. Effectively deploy and manage multi-node Apache Storm clusters in real-world environments. Monitor and analyze your data streams and system health effectively using built-in and external tools. Learn to implement fault-tolerant, scalable, and distributed stream processing applications in Apache Storm. Author(s) None Jain is an experienced software developer and technical instructor specializing in distributed systems and real-time data processing. With years of experience working with Apache Storm and related technologies, their teachings focus on practical, hands-on learning to equip readers with actionable skills. Who is it for? This book is ideal for Java developers aspiring to build expertise in real-time data streaming and distributed processing applications using Apache Storm. Beginners can start with the fundamentals provided, while those with prior knowledge can delve into intermediate and advanced implementations.

Apache Spark 2.x for Java Developers

Delve into mastering big data processing with 'Apache Spark 2.x for Java Developers.' This book provides a practical guide to implementing Apache Spark using the Java APIs, offering a unique opportunity for Java developers to leverage Spark's powerful framework without transitioning to Scala. What this Book will help me do Learn how to process data from formats like XML, JSON, CSV using Spark Core. Implement real-time analytics using Spark Streaming and third-party tools like Kafka. Understand data querying with Spark SQL and master SQL schema processing. Apply machine learning techniques with Spark MLlib to real-world scenarios. Explore graph processing and analytics using Spark GraphX. Author(s) None Kumar and None Gulati, experienced professionals in Java development and big data, bring their wealth of practical experience and passion for teaching to this book. With a clear and concise writing style, they aim to simplify Spark for Java developers, making big data approachable. Who is it for? This book is perfect for Java developers who are eager to expand their skillset into big data processing with Apache Spark. Whether you are a seasoned Spark user or first diving into big data concepts, this book meets you at your level. With practical examples and straightforward explanations, you can unlock the potential of Spark in real-world scenarios.

Mastering Apache Spark 2.x - Second Edition

Mastering Apache Spark 2.x is the essential guide to harnessing the power of big data processing. Dive into real-time data analytics, machine learning, and cluster computing using Apache Spark's advanced features and modules like Spark SQL and MLlib. What this Book will help me do Gain proficiency in Spark's batch and real-time data processing with SparkSQL. Master techniques for machine learning and deep learning using SparkML and SystemML. Understand the principles of Spark's graph processing with GraphX and GraphFrames. Learn to deploy Apache Spark efficiently on platforms like Kubernetes and IBM Cloud. Optimize Spark cluster performance by configuring parameters effectively. Author(s) Romeo Kienzler is a seasoned professional in big data and machine learning technologies. With years of experience in cloud-based distributed systems, Romeo brings practical insights into leveraging Apache Spark. He combines his deep technical expertise with a clear and engaging writing style. Who is it for? This book is tailored for intermediate Apache Spark users eager to deepen their knowledge in Spark 2.x's advanced features. Ideal for data engineers and big data professionals seeking to enhance their analytics pipelines with Spark. A basic understanding of Spark and Scala is necessary. If you're aiming to optimize Spark for real-world applications, this book is crafted for you.

SQL Server 2016 High Availability Unleashed (includes Content Update Program)

Book + Content Update Program SQL Server 2016 High Availability Unleashed provides start-to-finish coverage of SQL Server’s powerful high availability (HA) solutions for your traditional on-premise databases, cloud-based databases (Azure or AWS), hybrid databases (on-premise coupled with the cloud), and your emerging Big Data solutions. This complete guide introduces an easy-to-follow, formal HA methodology that has been refined over the past several years and helps you identity the right HA solution for your needs. There is also additional coverage of both disaster recovery and business continuity architectures and considerations. You are provided with step-by-step guides, examples, and sample code to help you set up, manage, and administer these highly available solutions. All examples are based on existing production deployments at major Fortune 500 companies around the globe. This book is for all intermediate-to-advanced SQL Server and Big Data professionals, but is also organized so that the first few chapters are great foundation reading for CIOs, CTOs, and even some tech-savvy CFOs. Learn a formal, high availability methodology for understanding and selecting the right HA solution for your needs Deep dive into Microsoft Cluster Services Use selective data replication topologies Explore thorough details on AlwaysOn and availability groups Learn about HA options with log shipping and database mirroring/ snapshots Get details on Microsoft Azure for Big Data and Azure SQL Explore business continuity and disaster recovery Learn about on-premise, cloud, and hybrid deployments Provide all types of database needs, including online transaction processing, data warehouse and business intelligence, and Big Data Explore the future of HA and disaster recovery In addition, this book is part of InformIT’s exciting Content Update Program, which provides content updates for major technology improvements! As significant updates are made to SQL Server, sections of this book will be updated or new sections will be added to match the updates to the technologies. As updates become available, they will be delivered to you via a free Web Edition of this book, which can be accessed with any Internet connection. To learn more, visit informit.com/cup. How to access the Web Edition: Follow the instructions inside to learn how to register your book to access the FREE Web Edition. * The companion material is not available with the online edition on O'Reilly Learning

Frank Kane's Taming Big Data with Apache Spark and Python

This book introduces you to the world of Big Data processing using Apache Spark and Python. You will learn to set up and run Spark on different systems, process massive datasets, and create solutions to real-world Big Data challenges with over 15 hands-on examples included. What this Book will help me do Understand the basics of Apache Spark and its ecosystem. Learn how to process large datasets with Spark RDDs using Python. Implement machine learning models with Spark's MLlib library. Master real-time data processing with Spark Streaming modules. Deploy and run Spark jobs on cloud clusters using AWS EMR. Author(s) Frank Kane spent 9 years working at Amazon and IMDb, handling and solving real-world machine learning and Big Data problems. Today, as an instructional designer and educator, he brings his wealth of experience to learners around the globe by creating accessible, practical learning resources. His teaching is clear, engaging, and designed to prepare students for real-world applications. Who is it for? This book is ideal for data scientists or data analysts seeking to delve into Big Data processing with Apache Spark. Readers who have foundational knowledge of Python, as well as some understanding of data processing principles, will find this book useful to sharpen their skills further. It is designed for those eager to learn the practical applications of Big Data tools in today's industry environments. By the end of this book, you should feel confident tackling Big Data challenges using Spark and Python.

Apache Spark 2.x Cookbook

Discover how to harness the power of Apache Spark 2.x for your Big Data processing projects. In this book, you will explore over 70 cloud-ready recipes that will guide you to perform distributed data analytics, structured streaming, machine learning, and much more. What this Book will help me do Effectively install and configure Apache Spark with various cluster managers and platforms. Set up and utilize development environments tailored for Spark applications. Operate on schema-aware data using RDDs, DataFrames, and Datasets. Perform real-time streaming analytics with sources such as Apache Kafka. Leverage MLlib for supervised learning, unsupervised learning, and recommendation systems. Author(s) None Yadav is a seasoned data engineer with a deep understanding of Big Data tools and technologies, particularly Apache Spark. With years of experience in the field of distributed computing and data analysis, Yadav brings practical insights and techniques to enrich the learning experience of readers. Who is it for? This book is ideal for data engineers, data scientists, and Big Data professionals who are keen to enhance their Apache Spark 2.x skills. If you're working with distributed processing and want to solve complex data challenges, this book addresses practical problems. Note that a basic understanding of Scala is recommended to get the most out of this resource.

Data Lake for Enterprises

"Data Lake for Enterprises" is a comprehensive guide to building data lakes using the Lambda Architecture. It introduces big data technologies like Hadoop, Spark, and Flume, showing how to use them effectively to manage and leverage enterprise-scale data. You'll gain the skills to design and implement data systems that handle complex data challenges. What this Book will help me do Master the use of Lambda Architecture to create scalable and effective data management systems. Understand and implement technologies like Hadoop, Spark, Kafka, and Flume in an enterprise data lake. Integrate batch and stream processing techniques using big data tools for comprehensive data analysis. Optimize data lakes for performance and reliability with practical insights and techniques. Implement real-world use cases of data lakes and machine learning for predictive data insights. Author(s) None Mishra, None John, and Pankaj Misra are recognized experts in big data systems with a strong background in designing and deploying data solutions. With a clear and methodical teaching style, they bring years of experience to this book, providing readers with the tools and knowledge required to excel in enterprise big data initiatives. Who is it for? This book is ideal for software developers, data architects, and IT professionals looking to integrate a data lake strategy into their enterprises. It caters to readers with a foundational understanding of Java and big data concepts, aiming to advance their practical knowledge of building scalable data systems. If you're eager to delve into cutting-edge technologies and transform enterprise data management, this book is for you.

Machine Learning with Spark - Second Edition

Dive into the world of distributed machine learning with Apache Spark, a powerful framework for handling, processing, and analyzing big data. This book will take you through implementing popular machine learning algorithms using Spark ML, covering end-to-end workflows such as data preparation, model building, predictive analysis, and text processing. What this Book will help me do Learn to implement scalable machine learning solutions using Spark ML. Develop the skills to set up and configure Apache Spark environments. Master the application of machine learning techniques like clustering, classification, and regression with Spark. Efficiently handle and process large-scale datasets using Spark tools. Put Spark's capabilities to work in building real-world distributed data processing solutions. Author(s) None Dua and None Ghotra bring a wealth of experience in big data and machine learning to this book. They have been involved in building scalable data systems and implementing machine learning solutions in various industry scenarios. Their approach is hands-on and focused on teaching practical, actionable knowledge. Who is it for? This book is perfect for data enthusiasts, data engineers, and machine learning practitioners who are familiar with Python and Scala, eager to apply machine learning concepts in distributed environments. It's aimed at professionals looking to develop their skills in building scalable data systems and implementing advanced machine learning workflows in Spark.

Sams Teach Yourself Hadoop in 24 Hours

Apache Hadoop is the technology at the heart of the Big Data revolution, and Hadoop skills are in enormous demand. Now, in just 24 lessons of one hour or less, you can learn all the skills and techniques you'll need to deploy each key component of a Hadoop platform in your local environment or in the cloud, building a fully functional Hadoop cluster and using it with real programs and datasets. Each short, easy lesson builds on all that's come before, helping you master all of Hadoop's essentials, and extend it to meet your unique challenges. Apache Hadoop in 24 Hours, Sams Teach Yourself covers all this, and much more: Understanding Hadoop and the Hadoop Distributed File System (HDFS) Importing data into Hadoop, and process it there Mastering basic MapReduce Java programming, and using advanced MapReduce API concepts Making the most of Apache Pig and Apache Hive Implementing and administering YARN Taking advantage of the full Hadoop ecosystem Managing Hadoop clusters with Apache Ambari Working with the Hadoop User Environment (HUE) Scaling, securing, and troubleshooting Hadoop environments Integrating Hadoop into the enterprise Deploying Hadoop in the cloud Getting started with Apache Spark Step-by-step instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Hadoop to solve a wide spectrum of Big Data problems.