Analytics

Accelerating Data Transformation with IBM DB2 Analytics Accelerator for z/OS Understanding and Using Accelerator-only Tables

2015-12-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Steffen Knoll , Uwe Denneler , Patric Becker , Frank Neumann , Wolfgang Hengstler , Eberhard Hechler , Guenter Georg Schoellmann , Khadija Souissi , Timm Zimmermann , Ute Baumbach

API Data Modelling IBM RDBMS SPSS data data-engineering ibm-db2 relational-databases

Transforming data from operational data models to purpose-oriented data structures has been commonplace for the last decades. Data transformations are heavily used in all types of industries to provide information to various users at different levels. Depending on individual needs, the transformed data is stored in various different systems. Sending operational data to other systems for further processing is then required, and introduces much complexity to an existing information technology (IT) infrastructure. Although maintenance of additional hardware and software is one component, potential inconsistencies and individually managed refresh cycles are others. For decades, there was no simple and efficient way to perform data transformations on the source system of operational data. With IBM® DB2® Analytics Accelerator, DB2 for z/OS is now in a unique position to complete these transformations in an efficient and well-performing way. DB2 for z/OS completes these while connecting to the same platform as for operational transactions, helping you to minimize your efforts to manage existing IT infrastructure. Real-time analytics on incoming operational transactions is another demand. Creating a comprehensive scoring model to detect specific patterns inside your data can easily require multiple iterations and multiple hours to complete. By enabling a first set of analytical functionality in DB2 Analytics Accelerator, those dedicated mining algorithms can now be run on an accelerator to efficiently perform these modeling tasks. Given the speed of query processing on an accelerator, these modeling tasks can now be performed much quicker compared to traditional relational database management systems. This speed enables you to keep your scoring algorithms more up-to-date, and ultimately adapt more quickly to constantly changing customer behaviors. This IBM Redbooks® publication describes the new table type that is introduced with DB2 Analytics Accelerator V4.1 PTF5 that enables more efficient data transformations. These tables are called accelerator-only tables, and can exist on an accelerator only. The tables benefit from the accelerator performance characteristics, while maintaining access through existing DB2 for z/OS application programming interfaces (APIs). Additionally, we describe the newly introduced analytical capabilities with DB2 Analytics Accelerator V5.1, putting you in the position to efficiently perform data modeling for online analytical requirements in your DB2 for z/OS environment. This book is intended for technical decision-makers who want to get a broad understanding about the analytical capabilities and accelerator-only tables of DB2 Analytics Accelerator. In addition, you learn about how these capabilities can be used to accelerate in-database transformations and in-database analytics in various environments and scenarios, including the following scenarios: Multi-step processing and reporting in IBM DB2 Query Management Facility™, IBM Campaign, or Microstrategy environments In-database transformations using IBM InfoSphere® DataStage® Ad hoc data analysis for data scientists In-database analytics using IBM SPSS® Modeler

Systems of Insight for Digital Transformation: Using IBM Operational Decision Manager Advanced and Predictive Analytics

2015-12-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Matthew Roberts , Hector H. Diaz Lopez , Whei-Jen Chen , Rajeev Kamath , Alexander Kelly , Yee Pin Yheng

IBM data data-engineering

Systems of record (SORs) are engines that generates value for your business. Systems of engagement (SOE) are always evolving and generating new customer-centric experiences and new opportunities to capitalize on the value in the systems of record. The highest value is gained when systems of record and systems of engagement are brought together to deliver insight. Systems of insight (SOI) monitor and analyze what is going on with various behaviors in the systems of engagement and information being stored or transacted in the systems of record. SOIs seek new opportunities, risks, and operational behavior that needs to be reported or have action taken to optimize business outcomes. Systems of insight are at the core of the Digital Experience, which tries to derive insights from the enormous amount of data generated by automated processes and customer interactions. Systems of Insight can also provide the ability to apply analytics and rules to real-time data as it flows within, throughout, and beyond the enterprise (applications, databases, mobile, social, Internet of Things) to gain the wanted insight. Deriving this insight is a key step toward being able to make the best decisions and take the most appropriate actions. Examples of such actions are to improve the number of satisfied clients, identify clients at risk of leaving and incentivize them to stay loyal, identify patterns of risk or fraudulent behavior and take action to minimize it as early as possible, and detect patterns of behavior in operational systems and transportation that lead to failures, delays, and maintenance and take early action to minimize risks and costs. IBM® Operational Decision Manager is a decision management platform that provides capabilities that support both event-driven insight patterns, and business-rule-driven scenarios. It also can easily be used in combination with other IBM Analytics solutions, as the detailed examples will show. IBM Operational Decision Manager Advanced, along with complementary IBM software offerings that also provide capability for systems of insight, provides a way to deliver the greatest value to your customers and your business. IBM Operational Decision Manager Advanced brings together data from different sources to recognize meaningful trends and patterns. It empowers business users to define, manage, and automate repeatable operational decisions. As a result, organizations can create and shape customer-centric business moments. This IBM Redbooks® publication explains the key concepts of systems of insight and how to implement a system of insight solution with examples. It is intended for IT architects and professionals who are responsible for implementing a systems of insights solution requiring event-based context pattern detection and deterministic decision services to enhance other analytics solution components with IBM Operational Decision Manager Advanced.

Data Munging with Hadoop

2015-11-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Casey Stella , Ofer Mendelevitch

AI/ML Data Quality Data Science Hadoop Hive NLP Spark data data-engineering

The Example-Rich, Hands-On Guide to Data Munging with Apache Hadoop TM Data scientists spend much of their time “munging” data: handling day-to-day tasks such as data cleansing, normalization, aggregation, sampling, and transformation. These tasks are both critical and surprisingly interesting. Most important, they deepen your understanding of your data’s structure and limitations: crucial insight for improving accuracy and mitigating risk in any analytical project. Now, two leading Hortonworks data scientists, Ofer Mendelevitch and Casey Stella, bring together powerful, practical insights for effective Hadoop-based data munging of large datasets. Drawing on extensive experience with advanced analytics, the authors offer realistic examples that address the common issues you’re most likely to face. They describe each task in detail, presenting example code based on widely used tools such as Pig, Hive, and Spark. This concise, hands-on eBook is valuable for every data scientist, data engineer, and architect who wants to master data munging: not just in theory, but in practice with the field’s #1 platform–Hadoop. Coverage includes A framework for understanding the various types of data quality checks, including cell-based rules, distribution validation, and outlier analysis Assessing tradeoffs in common approaches to imputing missing values Implementing quality checks with Pig or Hive UDFs Transforming raw data into “feature matrix” format for machine learning algorithms Choosing features and instances Implementing text features via “bag-of-words” and NLP techniques Handling time-series data via frequency- or time-domain methods Manipulating feature values to prepare for modeling Data Munging with Hadoop is part of a larger, forthcoming work entitled Data Science Using Hadoop. To be notified when the larger work is available, register your purchase of Data Munging with Hadoop at informit.com/register and check the box “I would like to hear from InformIT and its family of brands about products and special offers.”

Learning ELK Stack

2015-11-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Saurabh Chhajed

Data Analytics DevOps ELK Kibana Logstash Unix data data-engineering elastic-stack-elk-stack elastic stack (elk stack) elasticsearch search

Dive into the ELK stack-Elasticsearch, Logstash, and Kibana-with this comprehensive guide. Designed to help you set up, configure, and utilize the stack to its fullest, this book provides you with the skills to manage data with precision, enrich logs, and create meaningful analytics. Develop an entire data pipeline and cultivate powerful visual insights from your data. What this Book will help me do Install and configure Elasticsearch, Logstash, and Kibana to establish a robust ELK stack setup. Understand the role of each component in the stack and master the basics of log analysis. Create custom Logstash plugins to handle non-standard data processing requirements. Develop interactive and insightful data visualizations and dashboards using Kibana. Implement a complete data pipeline and gain expertise in data indexing, searching, and reporting. Author(s) None Chhajed brings depth of technical understanding and practical experience to the exploration of the ELK Stack. With a strong background in open-source technologies and data analytics, Chhajed has worked extensively with ELK stack implementations in real-world scenarios. Through this guide, the author offers clarity, detailed examples, and actionable insights for professionals seeking to improve their data systems. Who is it for? This book is targeted towards software developers, data analysts, and DevOps engineers seeking to harness the potential of the ELK stack for data analysis and logging. It is most suitable for intermediate-level professionals with basic knowledge of Unix or programming. If your aim is to gain insights and build metrics from diverse data formats utilizing open-source technologies, this book is crafted for you.

Elasticsearch in Action

2015-11-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Radu Gheorghe , Matthew Lee Hinman , Roy Russo

API Bash Cloud Computing Data Science ELK data data-engineering elasticsearch search

Elasticsearch in Action teaches you how to build scalable search applications using Elasticsearch. You'll ramp up fast, with an informative overview and an engaging introductory example. Within the first few chapters, you'll pick up the core concepts you need to implement basic searches and efficient indexing. With the fundamentals well in hand, you'll go on to gain an organized view of how to optimize your design. Perfect for developers and administrators building and managing search-oriented applications. About the Technology Modern search seems like magic'you type a few words and the search engine appears to know what you want. With the Elasticsearch real-time search and analytics engine, you can give your users this magical experience without having to do complex low-level programming or understand advanced data science algorithms. You just install it, tweak it, and get on with your work. About the Book Elasticsearch in Action teaches you how to write applications that deliver professional quality search. As you read, you'll learn to add basic search features to any application, enhance search results with predictive analysis and relevancy ranking, and use saved data from prior searches to give users a custom experience. This practical book focuses on Elasticsearch's REST API via HTTP. Code snippets are written mostly in bash using cURL, so they're easily translatable to other languages. What's Inside What is a great search application? Building scalable search solutions Using Elasticsearch with any language Configuration and tuning About the Reader This book is for developers and administrators building and managing search-oriented applications. About the Authors Radu Gheorghe is a search consultant and software engineer. Matthew Lee Hinman develops highly available, cloud-based systems. Roy Russo is a specialist in predictive analytics. Quotes To understand how a modern search infrastructure works is a daunting task. Radu, Matt, and Roy make it an engaging, hands-on experience. - Sen Xu, Twitter Inc. An indispensable guide to the challenges of search of semi-structured data. - Artur Nowak, Evidence Prime The best resource for a complex topic. Highly recommended. - Daniel Beck, juris GmbH Took me from confused to confident in a week. - Alan McCann, Givsum.com

Streaming Analytics with IBM Streams: Analyze More, Act Faster, and Get Continuous Insights

2015-11-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jacques Roy (IBM)

Big Data Data Analytics IBM Data Streaming data data-engineering streaming-architecture streaming-messaging

Gain a competitive edge with IBM Streams Turn data-in-motion into solid business opportunities with IBM Streams and let Streaming Analytics with IBM Streams show you how. This comprehensive guide starts out with a brief overview of different technologies used for big data processing and explanations on how data-in-motion can be utilized for business advantages. You will learn how to apply big data analytics and how they benefit from data-in-motion. Discover all about Streams starting with the main components then dive further with Stream instillation, and upgrade and management capabilities including tools used for production. Through a solid understanding of big in motion, detailed illustrations, Endnotes that provide additional learning resources, and end of chapter summaries with helpful insight, data analysists and professionals looking to get more from their data will benefit from expert insight on: Data-in-motion processing and how it can be applied to generate new business opportunities The three approaches to processing data in motion and pros and cons of each The main components of Streams from runtime to installation and administration Multiple purposes of the Text Analytics toolkit The evolving Streams ecosystem A detailed roadmap for programmers to quickly become fluent with Streams Data-in-motion is rapidly becoming a business tool used to discover more about customers and opportunities, however it is only valuable if have the tools and knowledge to analyze and apply. This is an expert guide to IBM Streams and how you can harness this powerful tool to gain a competitive business edge.

Building Real-Time Data Pipelines

2015-11-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Kevin White , Steven Camina , Gary Orenstein , Conor Doherty

ETL/ELT Marketing Data Streaming data data-engineering real-time-analytics streaming-messaging

Traditional data processing infrastructures—especially those that support applications—weren’t designed for our mobile, streaming, and online world. This O’Reilly report examines how today’s distributed, in-memory database management systems (IMDBMS) enable you to make quick decisions based on real-time data. In this report, executives from MemSQL Inc. provide options for using in-memory architectures to build real-time data pipelines. If you want to instantly track user behavior on websites or mobile apps, generate reports on a changing dataset, or detect anomalous activity in your system as it occurs, you’ll learn valuable lessons from some of the largest and most successful tech companies focused on in-memory databases. Explore the architectural principles of modern in-memory databases Understand what’s involved in moving from data silos to real-time data pipelines Run transactions and analytics in a single database, without ETL Minimize complexity by architecting a multipurpose data infrastructure Learn guiding principles for developing an optimally architected operational system Provide persistence and high availability mechanisms for real-time data Choose an in-memory architecture flexible enough to scale across a variety of deployment options Conor Doherty, Data Engineer at MemSQL, is responsible for creating content around database innovation, analytics, and distributed systems. Gary Orenstein, Chief Marketing Officer at MemSQL, leads marketing strategy, product management, communications, and customer engagement. Kevin White is the Director of of Operations and a content contributor at MemSQL. Steven Camiña is a Principal Product Manager at MemSQL. His experience spans B2B enterprise solutions, including databases and middleware platforms.

Cassandra Design Patterns - Second Edition

2015-11-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rajanarayanan Thottuvaikkatumana

Big Data Cassandra NoSQL RDBMS data data-engineering nosql-databases

Cassandra Design Patterns is your guide to harnessing the full potential of Apache Cassandra's distributed database capabilities through advanced design practices. Whether you're migrating from an RDBMS or implementing scalable storage for big data, this book provides clear strategies, practical examples, and real-world use cases demonstrating effective design patterns. What this Book will help me do Learn to integrate Cassandra with existing RDBMS solutions, enabling hybrid data architecture. Understand and implement key design patterns for distributed, scalable databases. Master the transition from RDBMS or cache systems to Cassandra with minimal disruption. Dive into time-series and temporal data patterns unique to Cassandra's strengths. Apply learned design patterns directly to real-world big data scenarios for analytics. Author(s) Rajanarayanan Thottuvaikkatumana, the author of Cassandra Design Patterns, is an expert in distributed systems and holds extensive experience in designing and implementing big data solutions. His hands-on approach to Cassandra is evident throughout the book as he bridges theoretical knowledge with practical applications. Rajanarayanan's approachable writing style aims to make complex concepts accessible. Who is it for? This book is ideal for big data developers and system architects who are familiar with the basics of Cassandra and are looking to deepen their understanding of design patterns for robust applications. Readers should have experience with relational databases and desire to migrate or integrate these concepts with NoSQL systems. Whether you're building solutions for data scalability, high availability, or analytics, Cassandra Design Patterns positions itself as an essential resource.

Real Time Analytics with SAP Hana

2015-10-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vinay Singh

Data Modelling SAP data data-engineering relational-databases

"Real Time Analytics with SAP HANA" offers a comprehensive, step-by-step guide to mastering analytics and data modeling in the powerful SAP HANA environment. This book covers everything from basic data modeling concepts to more advanced techniques like creating calculation views and leveraging SAP HANA artifacts. What this Book will help me do Understand and build analytics/data models in the SAP HANA environment. Create schemas, packages, and delivery units in SAP HANA Studio. Master real-time data replication using SLT and SAP HANA Studio. Learn about full-text search, fuzzy search, and other analytical capabilities in SAP HANA. Develop comprehensive use cases combining SAP HANA concepts and tools. Author(s) Vinay Singh, the author of this book, is a seasoned SAP HANA expert with extensive experience in analytics and data modeling. He has worked on multiple SAP HANA implementation and migration projects and brings this expertise into his writing. His practical examples and hands-on approach make SAP HANA concepts accessible to learners at all levels. Who is it for? This book is ideal for SAP HANA data modelers, developers, implementation or migration consultants, project managers, and architects. It is designed for individuals aiming to enhance their skill set in SAP HANA and master real-time analytics. Whether you are actively working with SAP HANA or just starting, this book will serve as a valuable guide.

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

2015-10-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Douglas Eadline

Big Data Data Analytics Data Lake DevOps Hadoop Apache HBase HDFS Hive Linux RDBMS Spark data +1 more

Get Started Fast with Apache Hadoop ® 2, YARN, and Today’s Hadoop Ecosystem With Hadoop 2.x and YARN, Hadoop moves beyond MapReduce to become practical for virtually any type of data processing. Hadoop 2.x and the Data Lake concept represent a radical shift away from conventional approaches to data usage and storage. Hadoop 2.x installations offer unmatched scalability and breakthrough extensibility that supports new and existing Big Data analytics processing methods and models. Hadoop ® 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers, and to navigate the powerful technologies that complement it. Eadline concisely introduces and explains every key Hadoop 2 concept, tool, and service, illustrating each with a simple “beginning-to-end” example and identifying trustworthy, up-to-date resources for learning more. This guide is ideal if you want to learn about Hadoop 2 without getting mired in technical details. Douglas Eadline will bring you up to speed quickly, whether you’re a user, admin, devops specialist, programmer, architect, analyst, or data scientist. Coverage Includes Understanding what Hadoop 2 and YARN do, and how they improve on Hadoop 1 with MapReduce Understanding Hadoop-based Data Lakes versus RDBMS Data Warehouses Installing Hadoop 2 and core services on Linux machines, virtualized sandboxes, or clusters Exploring the Hadoop Distributed File System (HDFS) Understanding the essentials of MapReduce and YARN application programming Simplifying programming and data movement with Apache Pig, Hive, Sqoop, Flume, Oozie, and HBase Observing application progress, controlling jobs, and managing workflows Managing Hadoop efficiently with Apache Ambari–including recipes for HDFS to NFSv3 gateway, HDFS snapshots, and YARN configuration Learning basic Hadoop 2 troubleshooting, and installing Apache Hue and Apache Spark

VersaStack Solution by Cisco and IBM with SQL, Spectrum Control, and Spectrum Protect

2015-10-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Asher Pemberton , Sanjeev Naldurgkar , Vadi Bhatt , Jon Tate , Filip Van Den Neucker

Agile/Scrum Cloud Computing IBM Modern Data Stack Microsoft Fabric SQL data data-engineering ibm-spectrum-control

Dynamic organizations want to accelerate growth while reducing costs. To do so, they must speed the deployment of business applications and adapt quickly to any changes in priorities. Organizations today require an IT infrastructure to be easy, efficient, and versatile. The VersaStack solution by Cisco and IBM® can help you accelerate the deployment of your data centers. It reduces costs by more efficiently managing information and resources while maintaining your ability to adapt to business change. The VersaStack solution combines the innovation of Cisco UCS Integrated Infrastructure with the efficiency of the IBM Storwize® storage system. The Cisco UCS Integrated Infrastructure includes the Cisco Unified Computing System (Cisco UCS), Cisco Nexus and Cisco MDS switches, and Cisco UCS Director. The IBM Storwize V7000 enhances virtual environments with its Data Virtualization, IBM Real-time Compression™, and IBM Easy Tier® features. These features deliver extraordinary levels of performance and efficiency. The VersaStack solution is Cisco Application Centric Infrastructure (ACI) ready. Your IT team can build, deploy, secure, and maintain applications through a more agile framework. Cisco Intercloud Fabric capabilities help enable the creation of open and highly secure solutions for the hybrid cloud. These solutions accelerate your IT transformation while delivering dramatic improvements in operational efficiency and simplicity. Cisco and IBM are global leaders in the IT industry. The VersaStack solution gives you the opportunity to take advantage of integrated infrastructure solutions that are targeted at enterprise applications, analytics, and cloud solutions. The VersaStack solution is backed by Cisco Validated Designs (CVD) to provide faster delivery of applications, greater IT efficiency, and less risk. This IBM Redbooks® publication is aimed at experienced storage administrators that are tasked with deploying a VersaStack solution with Microsoft Sequel (SQL), IBM Spectrum™ Protect, and IBM Spectrum Control™.

Sams Teach Yourself: Big Data Analytics with Microsoft HDInsight in 24 Hours

2015-10-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Arshad Ali , Manpreet Singh (Cognizant)

BI Big Data Cloud Computing Data Analytics Hadoop Apache HBase Hive Microsoft NoSQL Power BI Spark SSIS +2 more

Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours In just 24 lessons of one hour or less, Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours helps you leverage Hadoop’s power on a flexible, scalable cloud platform using Microsoft’s newest business intelligence, visualization, and productivity tools. This book’s straightforward, step-by-step approach shows you how to provision, configure, monitor, and troubleshoot HDInsight and use Hadoop cloud services to solve real analytics problems. You’ll gain more of Hadoop’s benefits, with less complexity–even if you’re completely new to Big Data analytics. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success. Practical, hands-on examples show you how to apply what you learn Quizzes and exercises help you test your knowledge and stretch your skills Notes and tips point out shortcuts and solutions Learn how to… Master core Big Data and NoSQL concepts, value propositions, and use cases Work with key Hadoop features, such as HDFS2 and YARN Quickly install, configure, and monitor Hadoop (HDInsight) clusters in the cloud Automate provisioning, customize clusters, install additional Hadoop projects, and administer clusters Integrate, analyze, and report with Microsoft BI and Power BI Automate workflows for data transformation, integration, and other tasks Use Apache HBase on HDInsight Use Sqoop or SSIS to move data to or from HDInsight Perform R-based statistical computing on HDInsight datasets Accelerate analytics with Apache Spark Run real-time analytics on high-velocity data streams Write MapReduce, Hive, and Pig programs Register your book at informit.com/register for convenient access to downloads, updates, and corrections as they become available.

Fast Data: Smart and at Scale

2015-10-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by John Hugg , Ryan Betts

IoT SaaS Data Streaming data data-engineering real-time-analytics streaming-messaging

The need for fast data applications is growing rapidly, driven by the IoT, the surge in machine-to-machine (M2M) data, global mobile device proliferation, and the monetization of SaaS platforms. So how do you combine real-time, streaming analytics with real-time decisions in an architecture that’s reliable, scalable, and simple? In this O’Reilly report, Ryan Betts and John Hugg from VoltDB examine ways to develop apps for fast data, using pre-defined patterns. These patterns are general enough to suit both the do-it-yourself, hybrid batch/streaming approach, as well as the simpler, proven in-memory approach available with certain fast database offerings. Their goal is to create a collection of fast data app development recipes. We welcome your contributions, which will be tested and included in future editions of this report.

Hadoop with Python

2015-10-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Donald Miner , Zach Radtka

API Data Science Hadoop HDFS Java Luigi PySpark Python Spark data data-engineering

Hadoop is mostly written in Java, but that doesn't exclude the use of other programming languages with this distributed storage and processing framework, particularly Python. With this concise book, you’ll learn how to use Python with the Hadoop Distributed File System (HDFS), MapReduce, the Apache Pig platform and Pig Latin script, and the Apache Spark cluster-computing framework. Authors Zachary Radtka and Donald Miner from the data science firm Miner & Kasch take you through the basic concepts behind Hadoop, MapReduce, Pig, and Spark. Then, through multiple examples and use cases, you'll learn how to work with these technologies by applying various Python tools. Use the Python library Snakebite to access HDFS programmatically from within Python applications Write MapReduce jobs in Python with mrjob, the Python MapReduce library Extend Pig Latin with user-defined functions (UDFs) in Python Use the Spark Python API (PySpark) to write Spark programs with Python Learn how to use the Luigi Python workflow scheduler to manage MapReduce jobs and Pig scripts Zachary Radtka, a platform engineer at Miner & Kasch, has extensive experience creating custom analytics that run on petabyte-scale data sets.

IBM Software for SAP Solutions

2015-09-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Khirallah Birkler , Navneet Goyal , Peter Bahrs , Nick Norris , Michel Laaroussi , Michael Love , Bernd Eberhardt , Jörg Stolzenberg , Andrew Stalnecker , Derek Jennings , Stefan Momma , Manfred Oevers , Yaro Dunchych , Joe Kaczmarek , Martin Oberhofer , James Hunter , Paul Pacholski , Pierre Valiquette

BI Data Management DevOps IBM Master Data Management SAP Cyber Security data data-engineering

SAP is a market leader in enterprise business application software. SAP solutions provide a rich set of composable application modules, and configurable functional capabilities that are expected from a comprehensive enterprise business application software suite. In most cases, companies that adopt SAP software remain heterogeneous enterprises running both SAP and non-SAP systems to support their business processes. Regardless of the specific scenario, in heterogeneous enterprises most SAP implementations must be integrated with a variety of non-SAP enterprise systems: Portals Messaging infrastructure Business process management (BPM) tools Enterprise Content Management (ECM) methods and tools Business analytics (BA) and business intelligence (BI) technologies Security Systems of record Systems of engagement The tooling included with SAP software addresses many needs for creating SAP-centric environments. However, the classic approach to implementing SAP functionality generally leaves the business with a rigid solution that is difficult and expensive to change and enhance. When SAP software is used in a large, heterogeneous enterprise environment, SAP clients face the dilemma of selecting the correct set of tools and platforms to implement SAP functionality, and to integrate the SAP solutions with non-SAP systems. This IBM® Redbooks® publication explains the value of integrating IBM software with SAP solutions. It describes how to enhance and extend pre-built capabilities in SAP software with best-in-class IBM enterprise software, enabling clients to maximize return on investment (ROI) in their SAP investment and achieve a balanced enterprise architecture approach. This book describes IBM Reference Architecture for SAP, a prescriptive blueprint for using IBM software in SAP solutions. The reference architecture is focused on defining the use of IBM software with SAP, and is not intended to address the internal aspects of SAP components. The chapters of this book provide a specific reference architecture for many of the architectural domains that are each important for a large enterprise to establish common strategy, efficiency, and balance. The majority of the most important architectural domain topics, such as integration, process optimization, master data management, mobile access, Enterprise Content Management, business intelligence, DevOps, security, systems monitoring, and so on, are covered in the book. However, there are several other architectural domains which are not included in the book. This is not to imply that these other architectural domains are not important or are less important, or that IBM does not offer a solution to address them. It is only reflective of time constraints, available resources, and the complexity of assembling a book on an extremely broad topic. Although more content could have been added, the authors feel confident that the scope of architectural material that has been included should provide organizations with a fantastic head start in defining their own enterprise reference architecture for many of the important architectural domains, and it is hoped that this book provides great value to those reading it. This IBM Redbooks publication is targeted to the following audiences: Client decision makers and solution architects leading enterprise transformation projects and wanting to gain further insight so that they can benefit from the integration of IBM software in large-scale SAP projects. IT architects and consultants integrating IBM technology with SAP solutions.

Managing Ever-Increasing Amounts of Data with IBM DB2 for z/OS: Using Temporal Data Management, Archive Transparency, and the DB2 Analytics Accelerator

2015-09-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Craig McKellar , Mehmet Cuneyt Goksu , Xiao Hui Wang , Claire McFeely

Data Management IBM data data-engineering ibm-db2 relational-databases

IBM® DB2® Version 11.1 for z/OS® (DB2 11 for z/OS or just DB2 11 throughout this book) is the fifteenth release of DB2 for IBM MVS™. The DB2 11 environment is available either for new installations of DB2 or for migrations from DB2 10 for z/OS subsystems only. This IBM Redbooks® publication describes enhancements that are available with DB2 11 for z/OS. The contents help database administrators to understand the new extensions and performance enhancements, to plan for ways to use the key new capabilities, and to justify the investment in installing or migrating to DB2 11. Businesses are faced with a global and increasingly competitive business environment, and they need to collect and analyze ever increasing amounts of data (Figure 1). Governments also need to collect and analyze large amounts of data. The main focus of this book is to introduce recent DB2 capability that can be used to address challenges facing organizations with storing and analyzing exploding amounts of business or organizational data, while managing risk and trying to meet new regulatory and compliance requirements. This book describes recent extensions to DB2 for z/OS in V10 and V11 that can help organizations address these challenges.

Getting Data Right

2015-09-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shannon Cutt

Big Data Data Analytics Data Science DWH ETL/ELT data data-engineering

Over the last 20 years, companies have invested roughly $3-4 trillion in enterprise software. These investments have been primarily focused on the development and deployment of single systems, applications, functions, and geographies targeted at the automation and optimization of key business processes. Companies are now investing heavily in big data analytics ($44 billion alone in 2014) in an effort to begin analyzing all of the data being generated from their process automation systems. But companies are quickly realizing that one of their key bottlenecks is Data Variety—the silo’d nature of the data that is a natural result of internal and external source proliferation. The problem of big data variety has crept up from the bottom—and the cost of variety is only appreciated when companies attempt to ask simple questions across many business silos (divisions, geographies, functions, etc.). Current top-down, deterministic data unification approaches (such as ETL, ELT, and MDM) were simply not designed to scale to the variety of hundreds or thousands or even tens of thousands of data silos. Download this free eBook to learn about the fundamental challenges that Data Variety poses to enterprises looking to maximize the value of their existing investments—and how new approaches promise to help organizations embrace and leverage the fundamental diversity of data. Readers will also find best practices for designing bottom-up and probabilistic methods for finding and managing data; principles for doing data science at scale in the big data era; preparing and unifying data in ways that complement existing systems; optimizing data warehousing; and how to use “data ops” to automate large-scale integration.

Apache Spark Graph Processing

2015-09-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rindra Ramamonjison

API Big Data Scala Spark apache-spark data data-engineering

Dive into the world of large-scale graph data processing with Apache Spark's GraphX API. This book introduces you to the core concepts of graph analytics and teaches you how to leverage Spark for handling and analyzing massive graphs. From building to analyzing, you'll acquire a comprehensive skillset to work with graph data efficiently. What this Book will help me do Learn to utilize Apache Spark GraphX API to process and analyze graph data. Master transforming raw datasets into sophisticated graph structures. Explore visualization and analysis techniques for understanding graphs. Understand and build custom graph operations tailored to your needs. Implement advanced graph algorithms like clustering and iterative processing. Author(s) Rindra Ramamonjison is a seasoned data engineer with vast experience in big data technologies and graph processing. With a passion for explaining complex concepts in simple terms, Rindra builds on his professional expertise to guide readers in mastering cutting-edge Spark tools. Who is it for? This book is tailored for data scientists and software developers looking to delve into graph data processing at scale. Ideal for those with basic knowledge of Scala and Apache Spark, it equips readers with the tools and techniques to derive insights from complex network datasets. Whether you're diving deeper into big data or exploring graph-specific analytics, this book is your guide.

Redis Essentials

2015-09-08 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Maxwell Dayvson da Silva

JavaScript Python Redis Cyber Security data data-engineering nosql-databases

Redis Essentials is your go-to guide for understanding and mastering Redis, the leading in-memory data structure store. In this book, you will explore the powerful features offered by Redis, such as real-time data processing, highly scalable architectures, and practical implementations for web applications. You'll complete the journey equipped to handle and optimize Redis for your development projects. What this Book will help me do Design analytics applications with advanced data structures like Bitmaps and HyperLogLogs. Scale your application infrastructure using Redis Sentinel, Twemproxy, and Redis Cluster. Develop custom Redis commands and extend its functionality with the Lua scripting language. Implement robust security measures for Redis, including SSL encryption and firewall rules. Master the usage of Redis client libraries in PHP, Python, Node.js, and Ruby for seamless development. Author(s) Maxwell Dayvson da Silva is an experienced software engineer and author with expertise in designing high-performance systems. With a strong focus on practical knowledge and hands-on solutions, Maxwell brings over a decade of experience using Redis to this book. His approachable teaching style ensures learners grasp complex topics easily while emphasizing their practical application to real-world challenges. Who is it for? Redis Essentials is aimed at developers looking to enhance their system's performance and scalability using Redis. Whether you're moderately familiar with key-value stores or new to Redis, this book will provide the explanations and hands-on examples you need. Recommended for developers with experience in data architectures, the book bridges the gap between understanding Redis features and their real-world application. Start here to bring high-performance in-memory data solutions to your projects.

The Architecture of Privacy

2015-09-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by John K Grant , Daniel Slate , Ari Gesher , Elissa Lerner , Courtney Bowman

Data Analytics Cyber Security data data-engineering data-security-privacy data security & privacy

Technology’s influence on privacy not only concerns consumers, political leaders, and advocacy groups, but also the software architects who design new products. In this practical guide, experts in data analytics, software engineering, security, and privacy policy describe how software teams can make privacy-protective features a core part of product functionality, rather than add them late in the development process. Ideal for software engineers new to privacy, this book helps you examine privacy-protective information management architectures and their foundational components—building blocks that you can combine in many ways. Policymakers, academics, students, and advocates unfamiliar with the technical terrain will learn how these tools can help drive policies to maximize privacy protection.

talk-data.com

Activity Trend

Top Events

Top Speakers

Accelerating Data Transformation with IBM DB2 Analytics Accelerator for z/OS Understanding and Using Accelerator-only Tables

Systems of Insight for Digital Transformation: Using IBM Operational Decision Manager Advanced and Predictive Analytics

Data Munging with Hadoop

Learning ELK Stack

Elasticsearch in Action

Streaming Analytics with IBM Streams: Analyze More, Act Faster, and Get Continuous Insights

Building Real-Time Data Pipelines

Cassandra Design Patterns - Second Edition

Real Time Analytics with SAP Hana

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

VersaStack Solution by Cisco and IBM with SQL, Spectrum Control, and Spectrum Protect

Sams Teach Yourself: Big Data Analytics with Microsoft HDInsight in 24 Hours

Fast Data: Smart and at Scale

Hadoop with Python

IBM Software for SAP Solutions

Managing Ever-Increasing Amounts of Data with IBM DB2 for z/OS: Using Temporal Data Management, Archive Transparency, and the DB2 Analytics Accelerator

Getting Data Right

Apache Spark Graph Processing

Redis Essentials

The Architecture of Privacy