O'Reilly Data Engineering Books

Interactive Spark using PySpark

2016-08-15 O'Reilly Amazon

book

Benjamin Bengfort , Jenny Kim

data data-engineering apache-spark PySpark AI/ML Analytics

Apache Spark is an in-memory framework that allows data scientists to explore and interact with big data much more quickly than with Hadoop. Python users can work with Spark using an interactive shell called PySpark. Why is it important? PySpark makes the large-scale data processing capabilities of Apache Spark accessible to data scientists who are more familiar with Python than Scala or Java. This also allows for reuse of a wide variety of Python libraries for machine learning, data visualization, numerical analysis, etc. What you'll learn—and how you can apply it Compare the different components provided by Spark, and what use cases they fit. Learn how to use RDDs (resilient distributed datasets) with PySpark. Write Spark applications in Python and submit them to the cluster as Spark jobs. Get an introduction to the Spark computing framework. Apply this approach to a worked example to determine the most frequent airline delays in a specific month and year. This lesson is for you because… You're a data scientist, familiar with Python coding, who needs to get up and running with PySpark You're a Python developer who needs to leverage the distributed computing resources available on a Hadoop cluster, without learning Java or Scala first Prerequisites Familiarity with writing Python applications Some familiarity with bash command-line operations Basic understanding of how to use simple functional programming constructs in Python, such as closures, lambdas, maps, etc. Materials or downloads needed in advance Apache Spark This lesson is taken from by Jenny Kim and Benjamin Bengfort. Data Analytics with Hadoop

Enabling Real-time Analytics on IBM z Systems Platform

2016-08-08 O'Reilly Amazon

book

Cedrine Madera , Ravi Kumar , Steven LaFalce , Sebastian Muszytowski , Oliver Benke , Lydia Parziale , Willie Favero

data data-engineering IBM AI/ML Analytics Data Modelling

Regarding online transaction processing (OLTP) workloads, IBM® z Systems™ platform, with IBM DB2®, data sharing, Workload Manager (WLM), geoplex, and other high-end features, is the widely acknowledged leader. Most customers now integrate business analytics with OLTP by running, for example, scoring functions from transactional context for real-time analytics or by applying machine-learning algorithms on enterprise data that is kept on the mainframe. As a result, IBM adds investment so clients can keep the complete lifecycle for data analysis, modeling, and scoring on z Systems control in a cost-efficient way, keeping the qualities of services in availability, security, reliability that z Systems solutions offer. Because of the changed architecture and tighter integration, IBM has shown, in a customer proof-of-concept, that a particular client was able to achieve an orders-of-magnitude improvement in performance, allowing that client’s data scientist to investigate the data in a more interactive process. Open technologies, such as Predictive Model Markup Language (PMML) can help customers update single components instead of being forced to replace everything at once. As a result, you have the possibility to combine your preferred tool for model generation (such as SAS Enterprise Miner or IBM SPSS® Modeler) with a different technology for model scoring (such as Zementis, a company focused on PMML scoring). IBM SPSS Modeler is a leading data mining workbench that can apply various algorithms in data preparation, cleansing, statistics, visualization, machine learning, and predictive analytics. It has over 20 years of experience and continued development, and is integrated with z Systems. With IBM DB2 Analytics Accelerator 5.1 and SPSS Modeler 17.1, the possibility exists to do the complete predictive model creation including data transformation within DB2 Analytics Accelerator. So, instead of moving the data to a distributed environment, algorithms can be pushed to the data, using cost-efficient DB2 Accelerator for the required resource-intensive operations. This IBM Redbooks® publication explains the overall z Systems architecture, how the components can be installed and customized, how the new IBM DB2 Analytics Accelerator loader can help efficient data loading for z Systems data and external data, how in-database transformation, in-database modeling, and in-transactional real-time scoring can be used, and what other related technologies are available. This book is intended for technical specialists and architects, and data scientists who want to use the technology on the z Systems platform. Most of the technologies described in this book require IBM DB2 for z/OS®. For acceleration of the data investigation, data transformation, and data modeling process, DB2 Analytics Accelerator is required. Most value can be archived if most of the data already resides on z Systems platforms, although adding external data (like from social sources) poses no problem at all.

Implementing an IBM High-Performance Computing Solution on IBM Power System S822LC

2016-07-25 O'Reilly Amazon

book

Dino Quintero , Wainer dos Santos Moschetta , Georgy E Pavlov , Mauricio Faria de Oliveira , Tsuyoshi Kamenoue , Luis Carlos Cruz Huertas , Alexander Pozdneev

data data-engineering IBM Analytics Data Analytics Linux

This IBM® Redbooks® publication demonstrates and documents that IBM Power Systems™ high-performance computing and technical computing solutions deliver faster time to value with powerful solutions. Configurable into highly scalable Linux clusters, Power Systems offer extreme performance for demanding workloads such as genomics, finance, computational chemistry, oil and gas exploration, and high-performance data analytics. This book delivers a high-performance computing solution implemented on the IBM Power System S822LC. The solution delivers high application performance and throughput based on its built-for-big-data architecture that incorporates IBM POWER8® processors, tightly coupled Field Programmable Gate Arrays (FPGAs) and accelerators, and faster I/O by using Coherent Accelerator Processor Interface (CAPI). This solution is ideal for clients that need more processing power while simultaneously increasing workload density and reducing datacenter floor space requirements. The Power S822LC offers a modular design to scale from a single rack to hundreds, simplicity of ordering, and a strong innovation roadmap for graphics processing units (GPUs). This publication is targeted toward technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) responsible for delivering cost effective high-performance computing (HPC) solutions that help uncover insights from their data so they can optimize business results, product development, and scientific discoveries

IBM Netcool Operations Insight: A Scenarios Guide

2016-07-20 O'Reilly Amazon

book

Lanny Short , Manzoor Farid , Maciej Olejniczak , Vasfi Gucer , Ahmed A Saleh , Zane Bray , Steve Shuman , Jeff Ditto , Rob Clark

data data-engineering IBM Analytics Cloud Computing

IBM® Netcool® Operations Insight empowers your IT operations to use real-time and historical analytics to identify, isolate, and resolve problems before they affect your business. Powered by IBM Tivoli® Netcool/OMNIbus and the transformative capabilities of cognitive analytics, Netcool Operations Insight consolidates millions of alerts from across local, cloud, and hybrid environments into a few actionable problems. This IBM Redbooks® publication gives a broad understanding of Netcool Operations Insight and describes several scenarios that show the capabilities of this solution in a real-life environment. Each scenario features a different capability of Netcool Operations Insight. The scenarios are documented by using step-by-step figures with explanations to make them easier to implement in your own environment. The scenarios in this book are broken into the following categories: - Network Management-related scenarios - Network Event and cognitive-related scenarios - Network Event-related scenarios The target audience of this book is network specialists, network administrators, and network operators.

Perspectives on Data Science for Software Engineering

2016-07-14 O'Reilly Amazon

book

Laurie Williams , Thomas Zimmermann , Tim Menzies

data data-science Analytics Cloud Computing Data Collection Data Science

Perspectives on Data Science for Software Engineering presents the best practices of seasoned data miners in software engineering. The idea for this book was created during the 2014 conference at Dagstuhl, an invitation-only gathering of leading computer scientists who meet to identify and discuss cutting-edge informatics topics. At the 2014 conference, the concept of how to transfer the knowledge of experts from seasoned software engineers and data scientists to newcomers in the field highlighted many discussions. While there are many books covering data mining and software engineering basics, they present only the fundamentals and lack the perspective that comes from real-world experience. This book offers unique insights into the wisdom of the community’s leaders gathered to share hard-won lessons from the trenches. Ideas are presented in digestible chapters designed to be applicable across many domains. Topics included cover data collection, data sharing, data mining, and how to utilize these techniques in successful software projects. Newcomers to software engineering data science will learn the tips and tricks of the trade, while more experienced data scientists will benefit from war stories that show what traps to avoid. Presents the wisdom of community experts, derived from a summit on software analytics Provides contributed chapters that share discrete ideas and technique from the trenches Covers top areas of concern, including mining security and social data, data visualization, and cloud-based data Presented in clear chapters designed to be applicable across many domains

Introducing Microsoft SQL Server 2016: Mission-Critical Applications, Deeper Insights, Hyperscale Cloud

2016-06-28 O'Reilly Amazon

book

Joseph D'Antoni , Stacia Varga , Denny Cherry

data data-engineering relational-databases microsoft-sql-server Analytics Cloud Computing

With Microsoft SQL Server 2016, a variety of new features and enhancements to the data platform deliver breakthrough performance, advanced security, and richer, integrated reporting and analytics capabilities. In this ebook, we introduce new security features: Always Encrypted, Row-Level Security, and dynamic data masking; discuss enhancements that enable you to better manage performance and storage: TemDB configuration, query store, and Stretch Database; review several improvements to Reporting Services; and also describe AlwaysOn Availability Groups, tabular enhancements, and R integration.

Relevant Search

2016-06-20 O'Reilly Amazon

book

Doug Turnbull , John Berryman

data data-engineering search Analytics ELK

Relevant Search demystifies relevance work. Using Elasticsearch, it teaches you how to return engaging search results to your users, helping you understand and leverage the internals of Lucene-based search engines. About the Technology Users are accustomed to and expect instant, relevant search results. To achieve this, you must master the search engine. Yet for many developers, relevance ranking is mysterious or confusing. About the Book Relevant Search demystifies the subject and shows you that a search engine is a programmable relevance framework. You'll learn how to apply Elasticsearch or Solr to your business's unique ranking problems. The book demonstrates how to program relevance and how to incorporate secondary data sources, taxonomies, text analytics, and personalization. In practice, a relevance framework requires softer skills as well, such as collaborating with stakeholders to discover the right relevance requirements for your business. By the end, you'll be able to achieve a virtuous cycle of provable, measurable relevance improvements over a search product's lifetime. What's Inside Techniques for debugging relevance Applying search engine features to real problems Using the user interface to guide searchers A systematic approach to relevance A business culture focused on improving search About the Reader For developers trying to build smarter search with Elasticsearch or Solr. About the Authors Doug Turnbull is lead relevance consultant at OpenSource Connections, where he frequently speaks and blogs. John Berryman is a data engineer at Eventbrite, where he specializes in recommendations and search. Quotes One of the best and most engaging technical books I’ve ever read. - From the Foreword by Trey Grainger, Author of "Solr in Action" Will help you solve real-world search relevance problems for Lucene-based search engines. - Dimitrios Kouzis-Loukas, Bloomberg L.P. An inspiring book revealing the essence and mechanics of relevant search. - Ursin Stauss, Swiss Post Arms you with invaluable knowledge to temper the relevancy of search results and harness the powerful features provided by modern search engines. - Russ Cam, Elastic

Ambient Computing

2016-06-15 O'Reilly Amazon

book

Mike Barlow

data data-engineering data-security-privacy data security & privacy Analytics Cyber Security

Consider this scenario: You walk into a building and a sensor identifies you through your mobile phone. You then receive a welcoming text telling you when lunch will be served, or perhaps a health warning based on allergy information you’ve stored in your profile. Maybe you’ll be flagged as a security threat. How is that possible? This O’Reilly report explores ambient computing—hands-free, 24/7 wireless connectivity to hardware, data, and IT systems. Enabling that scenario requires a lot of work behind the scenes to determine network connectivity, device security, and personal privacy. With an ambient-computing technology stack already in the works, resolving those issues is only a matter of time. Through interviews with front-line tech pioneers—including Ari Gesher (Kairos Aerospace) and Matthew Gast (Aerohive Networks)—author Mike Barlow explores how real-time analytics can enable real-time decision making. How will simple beacons broadcast information to your phone as you pass businesses on your morning walk? How can emotional speech analysis monitor the emotional state of employees, students, or people in crowds? Pick up this report and find out.

IBM z13s Technical Guide

2016-06-14 O'Reilly Amazon

book

John P. Troy , Ewerson Palacio , Martin Soellig , Cecilia A. De Leon , Jin J. Yang , Octavian Lascu , Barbara Sannerud , Franco Pinto , Edzard Hoogerbrug

data data-engineering IBM Analytics Cloud Computing Cyber Security

Digital business has been driving the transformation of underlying information technology (IT) infrastructure to be more efficient, secure, adaptive, and integrated. IT must be able to handle the explosive growth of mobile clients and employees. It also must be able to process enormous amounts of data to provide deep and real-time insights to help achieve the greatest business impact. This IBM® Redbooks® publication addresses the new IBM z Systems™ single frame, the IBM z13s server. IBM z Systems servers are the trusted enterprise platform for integrating data, transactions, and insight. A data-centric infrastructure must always be available with a 99.999% or better availability, have flawless data integrity, and be secured from misuse. It needs to be an integrated infrastructure that can support new applications. It also needs to have integrated capabilities that can provide new mobile capabilities with real-time analytics delivered by a secure cloud infrastructure. IBM z13s servers are designed with improved scalability, performance, security, resiliency, availability, and virtualization. The superscalar design allows z13s servers to deliver a record level of capacity over the prior single frame z Systems server. In its maximum configuration, the z13s server is powered by up to 20 client characterizable microprocessors (cores) running at 4.3 GHz. This configuration can run more than 18,000 millions of instructions per second (MIPS) and up to 4 TB of client memory. The IBM z13s Model N20 is estimated to provide up to 100% more total system capacity than the IBM zEnterprise® BC12 Model H13. This book provides information about the IBM z13s server and its functions, features, and associated software support. Greater detail is offered in areas relevant to technical planning. It is intended for systems engineers, consultants, planners, and anyone who wants to understand the IBM z Systems™ functions and plan for their usage. It is not intended as an introduction to mainframes. Readers are expected to be generally familiar with existing IBM z Systems technology and terminology.

Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark

2016-06-13 O'Reilly Amazon

book

Zubair Nabi

data data-engineering apache-spark AI/ML Analytics AWS Lambda

Learn the right cutting-edge skills and knowledge to leverage Spark Streaming to implement a wide array of real-time, streaming applications. This book walks you through end-to-end real-time application development using real-world applications, data, and code. Taking an application-first approach, each chapter introduces use cases from a specific industry and uses publicly available datasets from that domain to unravel the intricacies of production-grade design and implementation. The domains covered in Pro Spark Streaming include social media, the sharing economy, finance, online advertising, telecommunication, and IoT. In the last few years, Spark has become synonymous with big data processing. DStreams enhance the underlying Spark processing engine to support streaming analysis with a novel micro-batch processing model. Pro Spark Streaming by Zubair Nabi will enable you to become a specialist of latency sensitive applications by leveraging the key features of DStreams, micro-batch processing, and functional programming. To this end, the book includes ready-to-deploy examples and actual code. Pro Spark Streaming will act as the bible of Spark Streaming. What You'll Learn Discover Spark Streaming application development and best practices Work with the low-level details of discretized streams Optimize production-grade deployments of Spark Streaming via configuration recipes and instrumentation using Graphite, collectd, and Nagios Ingest data from disparate sources including MQTT, Flume, Kafka, Twitter, and a custom HTTP receiver Integrate and couple with HBase, Cassandra, and Redis Take advantage of design patterns for side-effects and maintaining state across the Spark Streaming micro-batch model Implement real-time and scalable ETL using data frames, SparkSQL, Hive, and SparkR Use streaming machine learning, predictive analytics, and recommendations Mesh batch processing with stream processing via the Lambda architecture Who This Book Is For Data scientists, big data experts, BI analysts, and data architects.

Spark GraphX in Action

2016-06-13 O'Reilly Amazon

book

Michael Malak , Robin East

data data-engineering apache-spark AI/ML Analytics API

Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it interactively. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data. About the Technology GraphX is a powerful graph processing API for the Apache Spark analytics engine that lets you draw insights from large datasets. GraphX gives you unprecedented speed and capacity for running massively parallel and machine learning algorithms. About the Book Spark GraphX in Action begins with the big picture of what graphs can be used for. This example-based tutorial teaches you how to use GraphX interactively. You'll start with a crystal-clear introduction to building big data graphs from regular data, and then explore the problems and possibilities of implementing graph algorithms and architecting graph processing pipelines. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data. What's Inside Understanding graph technology Using the GraphX API Developing algorithms for big graphs Machine learning with graphs Graph visualization About the Reader Readers should be comfortable writing code. Experience with Apache Spark and Scala is not required. About the Authors Michael Malak has worked on Spark applications for Fortune 500 companies since early 2013. Robin East has worked as a consultant to large organizations for over 15 years and is a data scientist at Worldpay. Quotes Learn complex graph processing from two experienced authors…A comprehensive guide. - Gaurav Bhardwaj, 3Pillar Global The best resource to go from GraphX novice to expert in the least amount of time. - Justin Fister, PaperRater A must-read for anyone serious about large-scale graph data mining! - Antonio Magnaghi, OpenMail Reveals the awesome and elegant capabilities of working with linked data for large-scale datasets. - Sumit Pal, Independent consultant

Manufacturing Performance Management using SAP OEE: Implementing and Configuring Overall Equipment Effectiveness

2016-06-07 O'Reilly Amazon

book

Dipankar Saha , Mahalakshmi Syamsunder , Sumanta Chakraborty

data data-engineering SAP Analytics Data Collection ERP

Learn how to configure, implement, enhance, and customize SAP OEE to address manufacturing performance management. Manufacturing Performance Management using SAP OEE will show you how to connect your business processes with your plant systems and how to integrate SAP OEE with ERP through standard workflows and shop floor systems for automated data collection. Manufacturing Performance Management using SAP OEE is a must-have comprehensive guide to implementing SAP OEE. It will ensure that SAP consultants and users understand how SAP OEE can offer solutions for manufacturing performance management in process industries. With this book in hand, managing shop floor execution effectively will become easier than ever. Authors Dipankar Saha and Mahalakshmi Symsunder, both SAP manufacturing solution experts, and Sumanta Chakraborty, product owner of SAP OEE, will explain execution and processing related concepts, manual and automatic data collection through the OEE Worker UI, and how to enhance and customize interfaces and dashboards for your specific purposes. You'll learn how to capture and categorize production and loss data and use it effectively for root-cause analysis. In addition, this book will show you: Various down-time handling scenarios. How to monitor, calculate, and define standard as well as industry-specific KPIs. How to carry out standard operational analytics for continuous improvement on the shop floor, at local plant level using MII and SAP Lumira, and also global consolidated analytics at corporation level using SAP HANA. Steps to benchmark manufacturing performance to compare similar manufacturing plants' performance, leading to a more efficient and effective shop floor. Manufacturing Performance Management using SAP OEE will provide you with in-depth coverage of SAP OEE and how to effectively leverage its features. This will allow you to efficiently manage the manufacturing process and to enhance the shop floor's overall performance, making you the sought-after SAP OEE expert in the organization. Manufacturing Performance Management using SAP OEE will provide you with in-depth coverage of SAP OEE and how to effectively leverage its features. This will allow you to efficiently manage the manufacturing process and to enhance the shop floor's overall performance, making you the sought-after SAP OEE expert in the organization. What You Will Learn Configure your ERP OEE add-on to build your plant and global hierarchy and relevant master data and KPIs Use the SAP OEE standard integration (SAP OEEINT) to integrate your ECC and OEE system to establish bi-directional integration between the enterprise and the shop floor Enable your shop floor operator on the OEE Worker UI to handle shop floor production execution Use SAP OEE as a tool for measuring manufacturing performance Enhance and customize SAP OEE to suit your specific requirements Create local plant-based reporting using SAP Lumira and MII Use standard SAP OEE HANA analytics Who This Book Is For SAP MII, ME, and OEE consultants and users who will implement and use the solution.

Implementing an Optimized Analytics Solution on IBM Power Systems

2016-06-01 O'Reilly Amazon

book

Dino Quintero , Robert Simon , Reinaldo Tetsuo Katahira , Kanako Harada , Brian Yaeger , Antonio Moreira de Oliveira Neto

data data-engineering IBM Analytics Big Data Cyber Security

This IBM® Redbooks® publication addresses topics to use the virtualization strengths of the IBM POWER8® platform to solve clients' system resource utilization challenges and maximize systems' throughput and capacity. This book addresses performance tuning topics that will help answer clients' complex analytic workload requirements, help maximize systems' resources, and provide expert-level documentation to transfer the how-to-skills to the worldwide teams. This book strengthens the position of IBM Analytics and Big Data solutions with a well-defined and documented deployment model within a POWER8 virtualized environment, offering clients a planned foundation for security, scaling, capacity, resilience, and optimization for analytics workloads. This book is targeted toward technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing analytics solutions and support on IBM Power Systems™.

Apache Spark Machine Learning Blueprints

2016-05-30 O'Reilly Amazon

book

Alex Liu

data data-engineering apache-spark AI/ML Analytics Big Data

In 'Apache Spark Machine Learning Blueprints', you'll explore how to create sophisticated and scalable machine learning projects using Apache Spark. This project-driven guide covers practical applications including fraud detection, customer analysis, and recommendation engines, helping you leverage Spark's capabilities for advanced data science tasks. What this Book will help me do Learn to set up Apache Spark efficiently for machine learning projects, unlocking its powerful processing capabilities. Integrate Apache Spark with R for detailed analytical insights, empowering your decision-making processes. Create predictive models for use cases including customer scoring, fraud detection, and risk assessment with practical implementations. Understand and utilize Spark's parallel computing architecture for large-scale machine learning tasks. Develop and refine recommendation systems capable of handling large user bases and datasets using Spark. Author(s) Alex Liu is a seasoned data scientist and software developer specializing in machine learning and big data technology. With extensive experience in using Apache Spark for predictive analytics, Alex has successfully built and deployed scalable solutions across industries. Their teaching approach combines theory and practical insights, making cutting-edge technologies accessible and actionable. Who is it for? This book is ideal for data analysts, data scientists, and developers with a foundation in machine learning who are eager to apply their knowledge in big data contexts. If you have a basic familiarity with Apache Spark and its ecosystem, and you're looking to enhance your ability to build machine learning applications, this resource is for you. It's particularly valuable for those aiming to utilize Spark for extensive data operations and gain practical, project-based insights.

IBM z13 Technical Guide

2016-05-27 O'Reilly Amazon

book

Ewerson Palacio , Martin Soellig , John Troy , Cecilia A. De Leon , Octavian Lascu , Jin Yang , Barbara Sannerud , Franco Pinto , Edzard Hoogerbrug

data data-engineering IBM Analytics Cloud Computing Cyber Security

Digital business has been driving the transformation of underlying IT infrastructure to be more efficient, secure, adaptive, and integrated. Information Technology (IT) must be able to handle the explosive growth of mobile clients and employees. IT also must be able to use enormous amounts of data to provide deep and real-time insights to help achieve the greatest business impact. This IBM® Redbooks® publication addresses the IBM Mainframe, the IBM z13™. The IBM z13 is the trusted enterprise platform for integrating data, transactions, and insight. A data-centric infrastructure must always be available with a 99.999% or better availability, have flawless data integrity, and be secured from misuse. It needs to be an integrated infrastructure that can support new applications. It needs to have integrated capabilities that can provide new mobile capabilities with real-time analytics delivered by a secure cloud infrastructure. IBM z13 is designed with improved scalability, performance, security, resiliency, availability, and virtualization. The superscalar design allows the z13 to deliver a record level of capacity over the prior IBM z Systems™. In its maximum configuration, z13 is powered by up to 141 client characterizable microprocessors (cores) running at 5 GHz. This configuration can run more than 110,000 millions of instructions per second (MIPS) and up to 10 TB of client memory. The IBM z13 Model NE1 is estimated to provide up to 40% more total system capacity than the IBM zEnterprise® EC12 (zEC1) Model HA1. This book provides information about the IBM z13 and its functions, features, and associated software support. Greater detail is offered in areas relevant to technical planning. It is intended for systems engineers, consultants, planners, and anyone who wants to understand the IBM z Systems functions and plan for their usage. It is not intended as an introduction to mainframes. Readers are expected to be generally familiar with existing IBM z Systems technology and terminology.

Streaming Architecture

2016-05-25 O'Reilly Amazon

book

Ellen Friedman , Ted Dunning

data data-engineering streaming-messaging streaming-architecture Analytics Flink

More and more data-driven companies are looking to adopt stream processing and streaming analytics. With this concise ebook, you’ll learn best practices for designing a reliable architecture that supports this emerging big-data paradigm. Authors Ted Dunning and Ellen Friedman (Real World Hadoop) help you explore some of the best technologies to handle stream processing and analytics, with a focus on the upstream queuing or message-passing layer. To illustrate the effectiveness of these technologies, this book also includes specific use cases. Ideal for developers and non-technical people alike, this book describes: Key elements in good design for streaming analytics, focusing on the essential characteristics of the messaging layer New messaging technologies, including Apache Kafka and MapR Streams, with links to sample code Technology choices for streaming analytics: Apache Spark Streaming, Apache Flink, Apache Storm, and Apache Apex How stream-based architectures are helpful to support microservices Specific use cases such as fraud detection and geo-distributed data streams Ted Dunning is Chief Applications Architect at MapR Technologies, and active in the open source community. He currently serves as VP for Incubator at the Apache Foundation, as a champion and mentor for a large number of projects, and as committer and PMC member of the Apache ZooKeeper and Drill projects. Ted is on Twitter as @ted_dunning. Ellen Friedman, a committer for the Apache Drill and Apache Mahout projects, is a solutions consultant and well-known speaker and author, currently writing mainly about big data topics. With a PhD in Biochemistry, she has years of experience as a research scientist and has written about a variety of technical topics. Ellen is on Twitter as @Ellen_Friedman.

Big Data in Practice

2016-05-02 O'Reilly Amazon

book

Bernard Marr

data data-engineering Analytics Big Data Microsoft

The best-selling author of Big Data is back, this time with a unique and in-depth insight into how specific companies use big data. Big data is on the tip of everyone's tongue. Everyone understands its power and importance, but many fail to grasp the actionable steps and resources required to utilise it effectively. This book fills the knowledge gap by showing how major companies are using big data every day, from an up-close, on-the-ground perspective. From technology, media and retail, to sport teams, government agencies and financial institutions, learn the actual strategies and processes being used to learn about customers, improve manufacturing, spur innovation, improve safety and so much more. Organised for easy dip-in navigation, each chapter follows the same structure to give you the information you need quickly. For each company profiled, learn what data was used, what problem it solved and the processes put it place to make it practical, as well as the technical details, challenges and lessons learned from each unique scenario. Learn how predictive analytics helps Amazon, Target, John Deere and Apple understand their customers Discover how big data is behind the success of Walmart, LinkedIn, Microsoft and more Learn how big data is changing medicine, law enforcement, hospitality, fashion, science and banking Develop your own big data strategy by accessing additional reading materials at the end of each chapter

Apache Hive Cookbook

2016-04-29 O'Reilly Amazon

book

Saurabh Chauhan , Hanish Bansal , Shrey Mehrotra

data data-engineering Hadoop apache-hive Analytics Big Data

Apache Hive Cookbook is a comprehensive resource for mastering Apache Hive, a tool that bridges the gap between SQL and Big Data processing. Through guided recipes, you'll acquire essential skills in Hive query development, optimization, and integration with modern big data frameworks. What this Book will help me do Design efficient Hive query structures for big data analytics. Optimize data storage and query execution using partitions and buckets. Integrate Hive seamlessly with frameworks like Spark and Hadoop. Understand and utilize the HiveQL syntax to perform advanced analytical processing. Implement practical solutions to secure, maintain, and scale Hive environments. Author(s) Hanish Bansal, Saurabh Chauhan, and Shrey Mehrotra bring their extensive expertise in big data technologies and Hive to this cookbook. With years of practical experience and deep technical knowledge, they offer a collection of solutions and best practices that reflect real-world use cases. Their commitment to clarity and depth makes this book an invaluable resource for exploring Hive to its fullest potential. Who is it for? This book is perfect for data professionals, engineers, and developers looking to enhance their capabilities in big data analytics using Hive. It caters to those with a foundational understanding of big data frameworks and some familiarity with SQL. Whether you're planning to optimize data handling or integrate Hive with other data tools, this guide helps you achieve your goals. Step into the world of efficient data analytics with Apache Hive through structured learning paths.

IT Modernization using Catalogic ECX Copy Data Management and IBM Spectrum Storage

2016-04-05 O'Reilly Amazon

book

Jon Tate , Christian Burns , Peter Eicher , Kamlesh Lad , Prashant Jagannathan

data data-engineering IBM Analytics Cloud Computing Data Management

Data is the currency of the new economy, and organizations are increasingly tasked with finding better ways to protect, recover, access, share, and use data. Traditional storage technologies are being stretched to the breaking point. This challenge is not because of storage hardware performance, but because management tools and techniques have not kept pace with new requirements. Primary data growth rates of 35% to 50% annually only amplify the problem. Organizations of all sizes find themselves needing to modernize their IT processes to enable critical new use cases such as storage self-service, Development and Operations (DevOps), and integration of data centers with the Cloud. They are equally challenged with improving management efficiencies for long established IT processes such as data protection, disaster recovery, reporting, and business analytics. Access to copies of data is the one common feature of all these use cases. However, the slow, manual processes common to IT organizations, including a heavy reliance on labor-intensive scripting and disparate tool sets, are no longer able to deliver the speed and agility required in today's fast-paced world. Copy Data Management (CDM) is an IT modernization technology that focuses on using existing data in a manner that is efficient, automated, scalable, and easy to use, delivering the data access that is urgently needed to meet the new use cases. Catalogic ECX, with IBM® storage, provides in-place copy data management that modernizes IT processes, enables key use cases, and does it all within existing infrastructure. This IBM Redbooks® publication shows how Catalogic Software and IBM have partnered together to create an integrated solution that addresses today's IT environment.

Hadoop Real-World Solutions Cookbook - Second Edition

2016-03-31 O'Reilly Amazon

book

Tanmay Deshpande

data data-engineering Hadoop AI/ML Analytics Big Data

Master the full potential of big data processing using Hadoop with this comprehensive guide. Featuring over 90 practical recipes, this book helps you streamline data workflows and implement machine learning models with tools like Spark, Hive, and Pig. By the end, you'll confidently handle complex data problems and optimize big data solutions effectively. What this Book will help me do Install and manage a Hadoop 2.x cluster efficiently to suit your data processing needs. Explore and utilize advanced tools like Hive, Pig, and Flume for seamless big data analysis. Master data import/export processes with Sqoop and workflows automation using Oozie. Implement machine learning and analytics tasks using Mahout and Apache Spark. Store and process data flexibly across formats like Parquet, ORC, RC, and more. Author(s) None Deshpande is an expert in big data processing and analytics with years of hands-on experience in implementing Hadoop-based solutions for real-world problems. Known for a clear and pragmatic writing style, None brings actionable wisdom and best practices to the forefront, helping readers excel in managing and utilizing big data systems. Who is it for? Designed for technical enthusiasts and professionals, this book is ideal for those familiar with basic big data concepts. If you are looking to expand your expertise in Hadoop's ecosystem and implement data-driven solutions, this book will guide you through essential skills and advanced techniques to efficiently manage complex big data projects.

MongoDB in Action, Second Edition

2016-03-29 O'Reilly Amazon

book

Douglas Garrett , Shaun Verch , Kyle Banker , Tim Hawkins , Peter Bakkum

data data-engineering nosql-databases MongoDB Analytics Big Data

GET MORE WITH MANNING An eBook copy of the previous edition, MongoDB in Action (First Edition), is included at no additional cost. It will be automatically added to your Manning Bookshelf within 24 hours of purchase. MongoDB in Action, Second Edition is a completely revised and updated version. It introduces MongoDB 3.0 and the document-oriented database model. This perfectly paced book gives you both the big picture you'll need as a developer and enough low-level detail to satisfy system engineers. About the Technology This document-oriented database was built for high availability, supports rich, dynamic schemas, and lets you easily distribute data across multiple servers. MongoDB 3.0 is flexible, scalable, and very fast, even with big data loads. About the Book MongoDB in Action, Second Edition is a completely revised and updated version. It introduces MongoDB 3.0 and the document-oriented database model. This perfectly paced book gives you both the big picture you'll need as a developer and enough low-level detail to satisfy system engineers. Lots of examples will help you develop confidence in the crucial area of data modeling. You'll also love the deep explanations of each feature, including replication, auto-sharding, and deployment. What's Inside Indexes, queries, and standard DB operations Aggregation and text searching Map-reduce for custom aggregations and reporting Deploying for scale and high availability Updated for Mongo 3.0 About the Reader Written for developers. No previous MongoDB or NoSQL experience is assumed. About the Authors After working at MongoDB, Kyle Banker is now at a startup. Peter Bakkum is a developer with MongoDB expertise. Shaun Verch has worked on the core server team at MongoDB. A Genentech engineer, Doug Garrett is one of the winners of the MongoDB Innovation Award for Analytics. A software architect, Tim Hawkins has led search engineering at Yahoo Europe. Technical Contributor: Wouter Thielen Technical Editor: Mihalis Tsoukalos Quotes A thorough manual for learning, practicing, and implementing MongoDB - Jeet Marwah, Acer Inc. A must-read to properly use MongoDB and model your data in the best possible way. - Hernan Garcia, Betterez Inc. Provides all the necessary details to get you jump-started with MongoDB. - Gregor Zurowski, Independent Software Development Consultant Awesome! MongoDB in a nutshell. - Hardy Ferentschik, Red Hat

Hadoop: What You Need to Know

2016-03-15 O'Reilly Amazon

book

Donald Miner

data data-engineering Hadoop Analytics Data Analytics Data Science

Hadoop has revolutionized data processing and enterprise data warehousing, but its explosive growth has come with a large amount of uncertainty, hype, and confusion. With this report, enterprise decision makers will receive a concise crash course on what Hadoop is and why it’s important. Hadoop represents a major shift from traditional enterprise data warehousing and data analytics, and its technology can be daunting at first. Donald Miner, founder of the data science firm Miner & Kasch, covers just enough ground so you can make intelligent decisions about Hadoop in your enterprise. By the end of this report, you’ll know the basics of technologies such as HDFS, MapReduce, and YARN, without becoming mired in the details. Not only will you learn the basics of how Hadoop works and why it’s such an important technology, you’ll get examples of how you should probably be using it.

Self-Service Analytics

2016-03-15 O'Reilly Amazon

book

Sandra Swanson

data data-engineering storage-repositories data-lake Analytics Data Governance

Organizations today are swimming in data, but most of them manage to analyze only a fraction of what they collect. To help build a stronger data-driven culture, many organizations are adopting a new approach called self-service analytics. This O’Reilly report examines how this approach provides data access to more people across a company, allowing business users to work with data themselves and create their own customized analyses. The result? More eyes looking at more data in more ways. Along with the perceived benefits, author Sandra Swanson also delves into the potential pitfalls of self-service analytics: balancing greater data access with concerns about security, data governance, and siloed data stores. Read this report and gain insights from enterprise tech (Yahoo), government (the City of Chicago), and disruptive retail (Warby Parker and Talend). Learn how these organizations are handling self-service analytics in practice. Sandra Swanson is a Chicago-based writer who’s covered technology, science, and business for dozens of publications, including ScientificAmerican.com. Connect with her on Twitter (@saswanson) or at www.saswanson.com.

IBM z13 and IBM z13s Technical Introduction

2016-03-08 O'Reilly Amazon

book

Ewerson Palacio , Martin Soellig , John Troy , Cecilia A. De Leon , Jin J. Yang , Bill White , Barbara Sannerud , Franco Pinto , Edzard Hoogerbrug

data data-engineering IBM Agile/Scrum Analytics API

This IBM® Redbooks® publication introduces the latest IBM z Systems™ platforms, the IBM z13™ and IBM z13s. It includes information about the z Systems environment and how it can help integrate data, transactions, and insight for faster and more accurate business decisions. The z13 and z13s are state-of-the-art data and transaction systems that deliver advanced capabilities that are vital to modern IT infrastructures. These capabilities include: Accelerated data and transaction serving Integrated analytics Access to the API economy Agile development and operations Efficient, scalable, and secure cloud services End-to-end security for data and transactions This book explains how these systems use both new innovations and traditional z Systems strengths to satisfy growing demand for cloud, analytics, and mobile applications. With one of these z Systems platforms as the base, applications can run in a trusted, reliable, and secure environment that both improves operations and lessens business risk.

IBM Spectrum Family: IBM Spectrum Control Standard Editon

2016-03-02 O'Reilly Amazon

book

Marion Hejny , Karen Orlando , Tiberiu Hajas , Lloyd Dean , Ruben Moreno , Hope Rodriguez , Johanna Hislop

data data-engineering IBM ibm-spectrum-control Analytics Data Management

IBM® Spectrum Control (Spectrum Control), a member of the IBM Spectrum™ Family of products, is the next-generation data management solution for software-defined environments (SDEs). With support for block, file, object workloads, and software-defined storage and predictive analytics, and automated and advanced monitoring to identify proactively storage performance problems, Spectrum Control enables administrators to provide efficient management for heterogeneous storage environments. IBM Spectrum Control™ (formerly IBM Tivoli® Storage Productivity Center) delivers a complete set of functions to manage IBM Spectrum Virtualize™, IBM Spectrum Accelerate™, and IBM Spectrum Scale™ storage infrastructures, and traditional IBM and select third-party storage hardware systems. This IBM Redbooks® publication provides practical examples and use cases that can be deployed with IBM Spectrum Control Standard Edition, with an overview of IBM Spectrum Control Advanced Edition. This book complements the Spectrum Control IBM Knowledge Center, which is referenced for product details, and for installation and implementation details throughout this book. You can find this resource as the following website: IBM Spectrum Control Knowledge Center Also provided are descriptions and an architectural overview of the IBM Spectrum Family, highlighting Spectrum Control, as integrated into software-defined storage environments. This publication is intended for storage administrators, clients who are responsible for maintaining IT and business infrastructures, and anyone who wants to learn more about employing Spectrum Control and Spectrum Control Standard Edition.

talk-data.com

O'Reilly Data Engineering Books

Top Topics

Top Speakers

Interactive Spark using PySpark

Enabling Real-time Analytics on IBM z Systems Platform

Implementing an IBM High-Performance Computing Solution on IBM Power System S822LC

IBM Netcool Operations Insight: A Scenarios Guide

Perspectives on Data Science for Software Engineering

Introducing Microsoft SQL Server 2016: Mission-Critical Applications, Deeper Insights, Hyperscale Cloud

Relevant Search

Ambient Computing

IBM z13s Technical Guide

Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark

Spark GraphX in Action

Manufacturing Performance Management using SAP OEE: Implementing and Configuring Overall Equipment Effectiveness

Implementing an Optimized Analytics Solution on IBM Power Systems

Apache Spark Machine Learning Blueprints

IBM z13 Technical Guide

Streaming Architecture

Big Data in Practice

Apache Hive Cookbook

IT Modernization using Catalogic ECX Copy Data Management and IBM Spectrum Storage

Hadoop Real-World Solutions Cookbook - Second Edition

MongoDB in Action, Second Edition

Hadoop: What You Need to Know

Self-Service Analytics

IBM z13 and IBM z13s Technical Introduction

IBM Spectrum Family: IBM Spectrum Control Standard Editon