talk-data.com talk-data.com

Topic

data-engineering

3377

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: O'Reilly Data Engineering Books ×
Frank Kane's Taming Big Data with Apache Spark and Python

This book introduces you to the world of Big Data processing using Apache Spark and Python. You will learn to set up and run Spark on different systems, process massive datasets, and create solutions to real-world Big Data challenges with over 15 hands-on examples included. What this Book will help me do Understand the basics of Apache Spark and its ecosystem. Learn how to process large datasets with Spark RDDs using Python. Implement machine learning models with Spark's MLlib library. Master real-time data processing with Spark Streaming modules. Deploy and run Spark jobs on cloud clusters using AWS EMR. Author(s) Frank Kane spent 9 years working at Amazon and IMDb, handling and solving real-world machine learning and Big Data problems. Today, as an instructional designer and educator, he brings his wealth of experience to learners around the globe by creating accessible, practical learning resources. His teaching is clear, engaging, and designed to prepare students for real-world applications. Who is it for? This book is ideal for data scientists or data analysts seeking to delve into Big Data processing with Apache Spark. Readers who have foundational knowledge of Python, as well as some understanding of data processing principles, will find this book useful to sharpen their skills further. It is designed for those eager to learn the practical applications of Big Data tools in today's industry environments. By the end of this book, you should feel confident tackling Big Data challenges using Spark and Python.

Learning Elasticsearch

This comprehensive guide to Elasticsearch will teach you how to build robust and scalable search and analytics applications using Elasticsearch 5.x. You will learn the fundamentals of Elasticsearch, including its APIs and tools, and how to apply them to real-world problems. By the end of the book, you will have a solid grasp of Elasticsearch and be ready to implement your own solutions. What this Book will help me do Master the setup and configuration of Elasticsearch and Kibana. Learn to efficiently query and analyze both structured and unstructured data. Understand how to use Elasticsearch aggregations to perform advanced analytics. Gain knowledge of advanced search features including geospatial queries and autocomplete. Explore the Elastic Stack and learn deployment best practices and cloud hosting options. Author(s) None Andhavarapu is an expert in database technology and distributed systems, with years of experience in Elasticsearch. Their passion for search technologies is reflected in their clear and practical teaching style. They've written this guide to help developers of all levels get up to speed with Elasticsearch quickly and comprehensively. Who is it for? This book is perfect for software developers looking to implement effective search and analytics solutions. It's ideal for those who are new to Elasticsearch as well as for professionals familiar with other search tools like Lucene or Solr. The book assumes basic programming knowledge but no prior experience with Elasticsearch.

SQL Server 2017 Integration Services Cookbook

SQL Server 2017 Integration Services Cookbook is your key to mastering effective data integration and transformation solutions using SSIS 2017. Through clear, concise recipes, this book teaches the advanced ETL techniques necessary for creating efficient data workflows, leveraging both traditional and modern data platforms. What this Book will help me do Master the integration of diverse data sources into comprehensive data models. Develop optimized ETL workflows that improve operational efficiency. Leverage the new features introduced in SQL Server 2017 for enhanced data processing. Implement scalable data warehouse solutions suitable for modern analytics workloads. Customize and extend integration services to handle specific data transformation needs. Author(s) The authors are seasoned professionals in data integration and ETL technologies. They bring years of real-world experience using SQL Server Integration Services in various enterprise scenarios. Their combined expertise ensures practical insights and guidance, making complex concepts accessible to learners and practitioners alike. Who is it for? This book is ideal for data engineers and ETL developers who already understand the basics of SQL Server and want to master advanced data integration techniques. It is also suitable for database administrators and data analysts aiming to enhance their skill set with efficient ETL processes. Arm yourself with this guide to learn not just the how, but also the why, behind successful data transformations.

Implementing OpenStack SwiftHLM with IBM Spectrum Archive EE or IBM Spectrum Protect for Space Management

The Swift High Latency Media project seeks to create a high-latency storage back end that makes it easier for users to perform bulk operations of data tiering within a Swift data ring. In today's world, data is produced at significantly higher rates than a decade ago. The storage and data management solutions of the past can no longer keep up with the data demands of today. The policies and structures that decide and execute how that data is used, discarded, or retained determines how efficiently the data is used. The need for intelligent data management and storage is more critical now than ever before. Traditional management approaches hide cost-effective, high-latency media (HLM) storage, such as tape or optical disk archive back ends, underneath a traditional file system. The lack of HLM-aware file system interfaces and software makes it difficult for users to understand and control data access on HLM storage. Coupled with data-access latency, this lack of understanding results in slow responses and potential time-outs that affect the user experience. The Swift HLM project addresses this challenge. Running OpenStack Swift on top of HLM storage allows you to cheaply store and efficiently access large amounts of infrequently used object data. Data that is stored on tape storage can be easily adopted to an Object Storage data interface. This IBM® Redpaper™ publication describes the Swift High Latency Media project and provides guidance for installation and configuration.

Development Workflows for Data Scientists

Data science teams often borrow best practices from software development, but since the product of a data science project is insight, not code, software development workflows are not a perfect fit. How can data scientists create workflows tailored to their needs? Through interviews with several data-driven organizations, this practical report reveals how data science teams are improving the way they define, enforce, and automate a development workflow. Data science workflows differ from team to team because their tasks, goals, and skills vary so much. In this report, author Ciara Byrne talked to teams from BinaryEdge, Airbnb, GitHub, Scotiabank, Fast Forward Labs, Datascope, and others about their approaches to the data science process, including their procedures for: Defining team structure and roles Asking interesting questions Examining previous work Collecting, exploring, and modeling data Testing, documenting, and deploying code to production Communicating the results With this report, you’ll also examine a complete data science workflow developed by the team from Swiss cybersecurity firm BinaryEdge that includes steps for preliminary data analysis, exploratory data analysis, knowledge discovery, and visualization.

Understanding Message Brokers

Messaging is one of the more poorly understood areas of IT; most developers and architects have only a passing familiarity with how broker-based messaging technologies work. This practical report not only helps you get up to speed on the essentials of messaging, but also compares two of today’s most popular messaging technologies—Apache ActiveMQ and Apache Kafka. Author and consultant Jakub Korab describes use cases and design choices that lead developers to very different approaches for developing message-based systems. You’ll come away with a high-level understanding of both ActiveMQ and Kafka, including how they should and should not be used, how they handle concerns such as throughput and high-availability, and what to look out for when considering other messaging technologies in future. Understand the types of problems that messaging systems address Explore three primary messaging patterns: point-to-point, publish-subscribe, and a hybrid of both Dive into ActiveMQ, a classic broker-centric design implemented through Java libraries that works for a broad range of messaging use cases Examine Kafka, a distributed system that can be scaled to provide massive performance and fault tolerance through replication Learn the mechanical complexities that message-based systems need to address, and some patterns you can apply to deal with those complexities

Practical GIS

Practical GIS introduces you to the world of Geographic Information Systems (GIS) using accessible, open source tools. From setting up your GIS environment to creating and analyzing spatial data and publishing it online, this book covers everything you need to perform both beginner and advanced GIS tasks. What this Book will help me do Understand the fundamentals of GIS and use open source tools effectively. Be able to collect, store, query, and manage spatial data efficiently. Perform advanced spatial analyses and solve real-world GIS problems practically. Learn how to publish and share GIS data and results using QGIS Server and GeoServer. Create web maps using lightweight web mapping libraries like Leaflet. Author(s) The authors of Practical GIS bring years of professional experience in GIS and data analysis, combining technical know-how with a teaching approach accessible to a wide range of learners. They strive to convey complex GIS concepts simply and practically, fostering a hands-on learning experience. Who is it for? This book is ideal for IT professionals new to GIS or those considering entering the GIS field. If you're looking for a cost-effective way to learn GIS without investing in expensive commercial software or formal training, Practical GIS provides the knowledge and tools you need. Beginners and intermediate learners alike will find this book to be a helpful stepping stone in mastering GIS.

Advanced Analytics with Spark, 2nd Edition

In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications. With this book, you will: Familiarize yourself with the Spark programming model Become comfortable within the Spark ecosystem Learn general approaches in data science Examine complete implementations that analyze large public data sets Discover which machine learning tools make sense for particular problems Acquire code that can be adapted to many uses

Apache Spark 2.x Cookbook

Discover how to harness the power of Apache Spark 2.x for your Big Data processing projects. In this book, you will explore over 70 cloud-ready recipes that will guide you to perform distributed data analytics, structured streaming, machine learning, and much more. What this Book will help me do Effectively install and configure Apache Spark with various cluster managers and platforms. Set up and utilize development environments tailored for Spark applications. Operate on schema-aware data using RDDs, DataFrames, and Datasets. Perform real-time streaming analytics with sources such as Apache Kafka. Leverage MLlib for supervised learning, unsupervised learning, and recommendation systems. Author(s) None Yadav is a seasoned data engineer with a deep understanding of Big Data tools and technologies, particularly Apache Spark. With years of experience in the field of distributed computing and data analysis, Yadav brings practical insights and techniques to enrich the learning experience of readers. Who is it for? This book is ideal for data engineers, data scientists, and Big Data professionals who are keen to enhance their Apache Spark 2.x skills. If you're working with distributed processing and want to solve complex data challenges, this book addresses practical problems. Note that a basic understanding of Scala is recommended to get the most out of this resource.

Data Lake for Enterprises

"Data Lake for Enterprises" is a comprehensive guide to building data lakes using the Lambda Architecture. It introduces big data technologies like Hadoop, Spark, and Flume, showing how to use them effectively to manage and leverage enterprise-scale data. You'll gain the skills to design and implement data systems that handle complex data challenges. What this Book will help me do Master the use of Lambda Architecture to create scalable and effective data management systems. Understand and implement technologies like Hadoop, Spark, Kafka, and Flume in an enterprise data lake. Integrate batch and stream processing techniques using big data tools for comprehensive data analysis. Optimize data lakes for performance and reliability with practical insights and techniques. Implement real-world use cases of data lakes and machine learning for predictive data insights. Author(s) None Mishra, None John, and Pankaj Misra are recognized experts in big data systems with a strong background in designing and deploying data solutions. With a clear and methodical teaching style, they bring years of experience to this book, providing readers with the tools and knowledge required to excel in enterprise big data initiatives. Who is it for? This book is ideal for software developers, data architects, and IT professionals looking to integrate a data lake strategy into their enterprises. It caters to readers with a foundational understanding of Java and big data concepts, aiming to advance their practical knowledge of building scalable data systems. If you're eager to delve into cutting-edge technologies and transform enterprise data management, this book is for you.

Mastering PostGIS

"Mastering PostGIS" is your guide to unlocking the powerful capabilities of the PostGIS spatial database system. Across 328 pages, this book takes you through the essentials of spatial data handling, from importing, analyzing, and exporting datasets to building fully-functional GIS applications. You'll explore concepts such as spatial querying, data types, and integrating PostGIS with powerful tools like GeoServer and OpenLayers. What this Book will help me do Understand the fundamentals of PostGIS and its role in GIS workflows. Gain hands-on experience in SQL-based spatial queries and data manipulation. Develop the ability to integrate PostGIS with web platforms like Node.js, GeoServer, and OpenLayers. Discover strategies for spatial data ETL (Extract, Transform, Load) processes and live updates. Build scalable, efficient GIS applications leveraging PostGIS's capabilities. Author(s) George Silva, None Mikiewicz, and Michal Mackiewicz None are experts in GIS systems and database technologies with years of experience working with spatial databases such as PostGIS. Passionate about imparting practical knowledge, they provide clear, hands-on examples in every chapter to help you master spatial database solutions. Who is it for? This book is perfect for GIS developers and analysts looking to deepen their knowledge of PostGIS. If you aim to enhance your skills in designing GIS applications or performing spatial data analysis, this is your ideal resource. Prior experience with PostgreSQL and a basic installation of PostGIS are recommended for readers.

Mastering Ceph

Mastering Ceph offers a comprehensive guide to mastering the Ceph distributed storage system, empowering you to implement and manage scalable storage solutions effectively. As you delve into the chapters, you'll gain the practical experience needed to handle Ceph with confidence, achieve resource optimization, and ensure high availability for critical applications. What this Book will help me do Understand and utilize Ceph's advanced capabilities such as erasure coding and tiering for storage efficiency. Implement and manage scalable and resilient Ceph clusters effectively, easing resource allocation. Use tools like Ansible and Vagrant to deploy Ceph clusters quickly and reproducibly. Enhance your troubleshooting skills to resolve complex storage issues and ensure cluster stability. Develop applications to integrate with Ceph using Librados and distributed computation classes. Author(s) This book was authored by None Fisk, an experienced professional in cloud and distributed storage systems. Known for their expertise in Ceph, None Fisk shares practical insights developed over years of working as an administrator and developer. Through their accessible and systematic writing, they guide readers to overcome real-world storage challenges. Who is it for? This detailed guide is ideal for developers and system administrators familiar with deploying Ceph, who want to deepen their understanding of its advanced features. If you're aiming to optimize performance and design robust storage solutions, this is the book for you. Prior experience with Ceph is recommended to fully benefit from the book's insights.

Mastering PostgreSQL 9.6

This comprehensive guide, 'Mastering PostgreSQL 9.6,' delves into the advanced features of PostgreSQL, equipping you with the skills to optimize queries, manage replication, and ensure high availability. Whether you are implementing advanced administrative tasks or enhancing database performance, this book will provide the tools and knowledge you need. What this Book will help me do Master advanced database functionalities in PostgreSQL 9.6. Enhance your proficiency in optimizing queries and using indexes effectively. Gain expertise in managing replication and ensuring high availability. Develop skills in server maintenance, monitoring, and resilience. Learn effective troubleshooting strategies for PostgreSQL database challenges. Author(s) Hans-Jürgen Schönig is an experienced database professional specializing in PostgreSQL consulting and training. With decades of experience in developing robust solutions, he brings a pragmatic and insightful approach to database management. His emphasis on practical application and clear explanations makes his writing accessible to learners at all levels. Who is it for? This book is ideal for PostgreSQL data architects and administrators looking to deepen their understanding of PostgreSQL's advanced functionalities. It's tailored for readers with prior experience in PostgreSQL administration and a working knowledge of SQL. If you're keen to master complex database tasks and optimize your PostgreSQL usage, you'll find this book invaluable.

Hadoop 2.x Administration Cookbook

Gain mastery over managing and maintaining large Apache Hadoop clusters with the Hadoop 2.x Administration Cookbook. This book provides practical step-by-step recipes guiding you to efficiently set up, optimize, and troubleshoot Hadoop clusters, ensuring high availability, security, and optimal performance in your data operations. What this Book will help me do Successfully set up and deploy an operational Hadoop 2.x cluster suitable for large-scale data operations. Effectively monitor and maintain Hadoop's HDFS, YARN, and MapReduce systems for optimized performance. Plan, configure, and enhance cluster availability using Zookeeper and Journal Node strategies. Develop workflows and manage data ingestion processes with tools like Flume and Oozie. Secure, troubleshoot, and optimize Hadoop environments to meet enterprise and operational standards. Author(s) Aman Singh is an experienced Hadoop administrator with years of hands-on experience managing robust and efficient Hadoop clusters. Aman has a deep understanding of the practical challenges faced in this field and a talent for breaking down complex topics into actionable steps. Through clear, problem-oriented language, Aman helps readers achieve fluency in Hadoop administration. Who is it for? This book is ideal for system administrators or IT professionals who have a foundational understanding of Hadoop and aim to strengthen their administrative skills. It is especially beneficial for experienced Hadoop administrators looking for a quick and practical reference guide to master cluster management. Whether you're working in a large enterprise or exploring Hadoop ecosystems for personal development, you'll find this book invaluable.

High Performance Spark

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources. Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing. With this book, you’ll explore: How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure The choice between data joins in Core Spark and Spark SQL Techniques for getting the most out of standard RDD transformations How to work around performance issues in Spark’s key/value pair paradigm Writing high-performance Spark code without Scala or the JVM How to test for functionality and performance when applying suggested improvements Using Spark MLlib and Spark ML machine learning libraries Spark’s Streaming components and external community packages

IBM PowerHA SystemMirror V7.2.1 for IBM AIX Updates

Abstract This IBM® Redbooks® publication helps strengthen the position of the IBM PowerHA® SystemMirror® solution with a well-defined and documented deployment models within an IBM Power Systems™ virtualized environment, which provides customers with a planned foundation for business resilience and disaster recovery for their IBM Power Systems infrastructure solutions. This publication addresses topics to help meet customers' complex high availability and disaster recovery requirements on IBM Power Systems servers to help maximize their systems' availability and resources, and provide technical documentation to transfer the how-to-skills to users and support teams. This book is targeted at technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing high availability and disaster recovery solutions and support with IBM PowerHA SystemMirror Standard and Enterprise Editions on IBM Power Systems servers.

Oracle on IBM z Systems

Abstract Oracle Database 12c Release 1 running on Linux is available for deployment on IBM® z Systems®. The enterprise-grade Linux on IBM z Systems solution is designed to add value to Oracle Database solutions, including the new functions that are introduced in Oracle Database 12c. In this IBM Redbooks® publication, we explore the IBM and Oracle Alliance and describe how Oracle Database benefits from IBM z Systems®. We then explain how to set up Linux guests to install Oracle Database 12c. We also describe how to use the Oracle Enterprise Manager Cloud Control Agent to manage Oracle Database 12c Release 1. We also describe a successful consolidation project from sizing to migration, performance management topics, and high availability. Finally, we end with a chapter about surrounding Oracle with Open Source software. The audience for this publication includes database consultants, installers, administrators, and system programmers. This publication is not meant to replace Oracle documentation, but to supplement it with our experiences while installing and using Oracle products.

Implementing the IBM System Storage SAN Volume Controller with IBM Spectrum Virtualize V7.8

Abstract This IBM® Redbooks® publication is a detailed technical guide to the IBM System Storage® SAN Volume Controller, which is powered by IBM Spectrum Virtualize™ Version 7.8. IBM SAN Volume Controller is a virtualization appliance solution, which maps virtualized volumes that are visible to hosts and applications to physical volumes on storage devices. Each server within the storage area network (SAN) has its own set of virtual storage addresses that are mapped to physical addresses. If the physical addresses change, the server continues running by using the same virtual addresses that it had before. Therefore, volumes or storage can be added or moved while the server is still running. The IBM virtualization technology improves the management of information at the "block" level in a network, which enables applications and servers to share storage devices on a network.

POWER8 High-performance Computing Guide IBM Power System S822LC (8335-GTB) Edition

Abstract This IBM® Redbooks® publication documents and addresses topics to provide step-by-step customizable application and programming solutions to tune application and workloads to use IBM Power Systems™ hardware architecture. This publication explores, tests, and documents the solution to use the architectural technologies and the software solutions that are available from IBM to help solve challenging technical and business problems. This publication also demonstrates and documents that the combination of IBM high-performance computing (HPC) solutions (hardware and software) delivers significant value to technical computing clients who are in need of cost-effective, highly scalable, and robust solutions. First, the book provides a high-level overview of the HPC solution, including all of the components that makes the HPC cluster: IBM Power System S822LC (8335-GTB), software components, interconnect switches, and the IBM Spectrum™ Scale parallel file system. Then, the publication is divided in three parts: Part 1 focuses on the developers, Part 2 focuses on the administrators, and Part 3 focuses on the evaluators and planners of the solution. The IBM Redbooks publication is targeted toward technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for delivering cost-effective HPC solutions that help uncover insights from vast amounts of client’s data so they can optimize business results, product development, and scientific discoveries.

IBM DB2 Web Query for i: The Nuts and Bolts

Abstract Business Intelligence (BI) is a broad term that relates to applications that analyze data to understand and act on the key metrics that drive profitability in an enterprise. Key to analyzing that data is providing fast, easy access to it while delivering it in formats or tools that best fit the needs of the user. At the core of any BI solution are user query and reporting tools that provide intuitive access to data supporting a spectrum of users from executives to “power users,” from spreadsheet aficionados to the external Internet consumer. IBM® DB2® Web Query for i offers a set of modernized tools for a more robust, extensible, and productive reporting solution than the popular IBM Query for System i® tool (also known as IBM Query/400). IBM DB2 Web Query for i preserves investments in the reports that are developed with Query/400 by offering a choice of importing definitions into the new technology or continuing to run existing Query/400 reports as is. But, it also offers significant productivity and performance enhancements by leveraging the latest in DB2 for i query optimization technology. The DB2 Web Query for i product is a web-based query and report writing product that offers enhanced capabilities over the IBM Query for iSeries product (also commonly known as Query/400). IBM DB2 Web Query for i includes Query for iSeries technology to assist customers in their transition to DB2 Web Query. It offers a more modernized, Java based solution for a more robust, extensible, and productive reporting solution. DB2 Web Query provides the ability to query or build reports against data that is stored in DB2 for i (or Microsoft SQL Server) databases through browser-based user interface technologies: Build reports with ease through the web-based, ribbon-like InfoAssist tool that leverages a common look and feel that can extend the number of personnel that can generate their own reports. Simplify the management of reports by significantly reducing the number of report definitions that are required through the use of parameter driven reports. Deliver data to users in many different formats, including directly into spreadsheets, or in boardroom-quality PDF format, or viewed from the browser in HTML. Leverage advanced reporting functions, such as matrix reporting, ranking, color coding, drill-down, and font customization to enhance the visualization of DB2 data. DB2 Web Query offers features to import Query/400 definitions and enhance their look and functions. By using it, you can add OLAP-like slicing and dicing to the reports or view reports in disconnected mode for users on the go. This IBM Redbooks® publication provides a broad understanding of what can be done with the DB2 Web Query product. This publication is a companion of DB2 Web Query Tutorials, SG24-8378, which has a group of self-explanatory tutorials to help you get up to speed quickly.