talk-data.com talk-data.com

Topic

Analytics

data_analysis insights metrics

395

tagged

Activity Trend

398 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: O'Reilly Data Engineering Books ×
Spark in Action

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0. About the Technology Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades. About the Book Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You'll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you'll start programming Spark using its core APIs. Along the way, you'll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book's code. What's Inside Updated for Spark 2.0 Real-life case studies Spark DevOps with Docker Examples in Scala, and online in Java and Python About the Reader Written for experienced programmers with some background in big data or machine learning. About the Authors Petar Zečević and Marko Bonaći are seasoned developers heavily involved in the Spark community. Quotes Dig in and get your hands dirty with one of the hottest data processing engines today. A great guide. - Jonathan Sharley, Pandora Media Must-have! Speed up your learning of Spark as a distributed computing framework. - Robert Ormandi, Yahoo! An easy-to-follow, step-by-step guide. - Gaurav Bhardwaj, 3Pillar Global An ambitiously comprehensive overview of Spark and its diverse ecosystem. - Jonathan Miller, Optensity

Fast Data Processing with Spark 2 - Third Edition

Fast Data Processing with Spark 2 takes you through the essentials of leveraging Spark for big data analysis. You will learn how to install and set up Spark, handle data using its APIs, and apply advanced functionality like machine learning and graph processing. By the end of the book, you will be well-equipped to use Spark in real-world data processing tasks. What this Book will help me do Install and configure Apache Spark for optimal performance. Interact with distributed datasets using the resilient distributed dataset (RDD) API. Leverage the flexibility of DataFrame API for efficient big data analytics. Apply machine learning models using Spark MLlib to solve complex problems. Perform graph analysis using GraphX to uncover structural insights in data. Author(s) Krishna Sankar is an experienced data scientist and thought leader in big data technologies. With a deep understanding of machine learning, distributed systems, and Apache Spark, Krishna has guided numerous projects in data engineering and big data processing. Matei Zaharia, the co-author, is also widely recognized in the field of distributed systems and cloud computing, contributing to Apache Spark development. Who is it for? This book is catered to software developers and data engineers with a foundational understanding of Scala or Java programming. Beginner to medium-level understanding of big data processing concepts is recommended for readers. If you are aspiring to solve big data problems using scalable distributed computing frameworks, this book is perfect for you. By the end, you will be confident in building Spark-powered applications and analyzing data efficiently.

In-Place Analytics with Live Enterprise Data with IBM DB2 Query Management Facility

IBM® DB2® Query Management Facility™ for z/OS® provides a zero-footprint, mobile-enabled, highly secure business analytics solution. IBM QMF™ V11.2.1 offers many significant new features and functions in keeping with the ongoing effort to broaden its usage and value to a wider set of users and business areas. In this IBM Redbooks® publication, we explore several of the new features and options that are available within this new release. This publication introduces TSO enhancements for QMF Analytics for TSO and QMF Enhanced Editor. A chapter describes how the QMF Data Service component connects to multiple mainframe data sources to accomplish the consolidation and delivery of data. This publication describes how self-service business intelligence can be achieved by using QMF Vision to enable self-service dashboards and data exploration. A chapter is dedicated to JavaScript support, demonstrating how application developers can use JavaScript to extend the capabilities of QMF. Additionally, this book describes methods to take advantage of caching for reduced CPU consumption, wider access to information, and faster performance. This publication is of interest to anyone who wants to better understand how QMF can enable in-place analytics with live enterprise data.

VersaStack Solution by Cisco and IBM with Oracle RAC, IBM FlashSystem V9000, and IBM Spectrum Protect

Dynamic organizations want to accelerate growth while reducing costs. To do so, they must speed the deployment of business applications and adapt quickly to any changes in priorities. Organizations today require an IT infrastructure that is easy, efficient, and versatile. The VersaStack solution by Cisco and IBM® can help you accelerate the deployment of your data centers. It reduces costs by more efficiently managing information and resources while maintaining your ability to adapt to business change. The VersaStack solution combines the innovation of Cisco UCS Integrated Infrastructure with the efficiency of the IBM Storwize® storage system. The Cisco UCS Integrated Infrastructure includes the Cisco Unified Computing System (Cisco UCS), Cisco Nexus and Cisco MDS switches, and Cisco UCS Director. The IBM FlashSystem® V9000 enhances virtual environments with its Data Virtualization, IBM Real-time Compression™, and IBM Easy Tier® features. These features deliver extraordinary levels of performance and efficiency. The VersaStack solution is Cisco Application Centric Infrastructure (ACI) ready. Your IT team can build, deploy, secure, and maintain applications through a more agile framework. Cisco Intercloud Fabric capabilities help enable the creation of open and highly secure solutions for the hybrid cloud. These solutions accelerate your IT transformation while delivering dramatic improvements in operational efficiency and simplicity. Cisco and IBM are global leaders in the IT industry. The VersaStack solution gives you the opportunity to take advantage of integrated infrastructure solutions that are targeted at enterprise applications, analytics, and cloud solutions. The VersaStack solution is backed by Cisco Validated Designs (CVD) to provide faster delivery of applications, greater IT efficiency, and less risk. This IBM Redbooks® publication is aimed at experienced storage administrators who are tasked with deploying a VersaStack solution with Oracle Real Application Clusters (RAC) and IBM Spectrum™ Protect.

Spark for Data Science

Explore how to leverage Apache Spark for efficient big data analytics and machine learning solutions in "Spark for Data Science". This detailed guide provides you with the skills to process massive datasets, perform data analytics, and build predictive models using Spark's powerful tools like RDDs, DataFrames, and Datasets. What this Book will help me do Gain expertise in data processing and transformation with Spark. Perform advanced statistical analysis to uncover insights. Master machine learning techniques to create predictive models using Spark. Utilize Spark's APIs to process and visualize big data. Build scalable and efficient data science solutions. Author(s) This book is co-authored by None Singhal and None Duvvuri, both accomplished data scientists with extensive experience in Apache Spark and big data technologies. They bring their practical industry expertise to explain complex topics in a straightforward manner. Their writing emphasizes real-world applications and step-by-step procedural guidance, making this a valuable resource for learners. Who is it for? This book is ideally suited for technologists seeking to incorporate data science capabilities into their work with Apache Spark, data scientists interested in machine learning algorithms implemented in Spark, and beginners aiming to step into the field of big data analytics. Whether you are familiar with Spark or completely new, this book offers valuable insights and practical knowledge.

Big Data Analytics

Dive into the world of big data with "Big Data Analytics: Real Time Analytics Using Apache Spark and Hadoop." This comprehensive guide introduces readers to the fundamentals and practical applications of Apache Spark and Hadoop, covering essential topics like Spark SQL, DataFrames, structured streaming, and more. Learn how to harness the power of real-time analytics and big data tools effectively. What this Book will help me do Master the key components of Apache Spark and Hadoop ecosystems, including Spark SQL and MapReduce. Gain an understanding of DataFrames, DataSets, and structured streaming for seamless data handling. Develop skills in real-time analytics using Spark Streaming and technologies like Kafka and HBase. Learn to implement machine learning models using Spark's MLlib and ML Pipelines. Explore graph analytics with GraphX and leverage data visualization tools like Jupyter and Zeppelin. Author(s) Venkat Ankam, an expert in big data technologies, has years of experience working with Apache Hadoop and Spark. As an educator and technical consultant, Venkat has enabled numerous professionals to gain critical insights into big data ecosystems. With a pragmatic approach, his writings aim to guide readers through complex systems in a structured and easy-to-follow manner. Who is it for? This book is perfect for data analysts, data scientists, software architects, and programmers aiming to expand their knowledge of big data analytics. Readers should ideally have a basic programming background in languages like Python, Scala, R, or SQL. Prior hands-on experience with big data environments is not necessary but is an added advantage. This guide is created to cater to a range of skill levels, from beginners to intermediate learners.

Big Data War

This book mainly focuses on why data analytics fails in business. It provides an objective analysis and root causes of the phenomenon, instead of abstract criticism of utility of data analytics. The author, then, explains in detail on how companies can survive and win the global big data competition, based on actual cases of companies. Having established the execution and performance-oriented big data methodology based on over 10 years of experience in the field as an authority in big data strategy, the author identifies core principles of data analytics using case analysis of failures and successes of actual companies. Moreover, he endeavors to share with readers the principles regarding how innovative global companies became successful through utilization of big data. This book is a quintessential big data analytics, in which the author’s knowhow from direct and indirect experiences is condensed. How do we survive at this big data war in which Facebook in SNS, Amazon in e-commerce, Google in search, expand their platforms to other areas based on their respective distinct markets? The answer can be found in this book. 

IBM Data Engine for Hadoop and Spark

This IBM® Redbooks® publication provides topics to help the technical community take advantage of the resilience, scalability, and performance of the IBM Power Systems™ platform to implement or integrate an IBM Data Engine for Hadoop and Spark solution for analytics solutions to access, manage, and analyze data sets to improve business outcomes. This book documents topics to demonstrate and take advantage of the analytics strengths of the IBM POWER8® platform, the IBM analytics software portfolio, and selected third-party tools to help solve customer's data analytic workload requirements. This book describes how to plan, prepare, install, integrate, manage, and show how to use the IBM Data Engine for Hadoop and Spark solution to run analytic workloads on IBM POWER8. In addition, this publication delivers documentation to complement available IBM analytics solutions to help your data analytic needs. This publication strengthens the position of IBM analytics and big data solutions with a well-defined and documented deployment model within an IBM POWER8 virtualized environment so that customers have a planned foundation for security, scaling, capacity, resilience, and optimization for analytics workloads. This book is targeted at technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) that are responsible for delivering analytics solutions and support on IBM Power Systems.

Real World SQL and PL/SQL: Advice from the Experts

Master the Underutilized Advanced Features of SQL and PL/SQL This hands-on guide from Oracle Press shows how to fully exploit lesser known but extremely useful SQL and PL/SQL features―and how to effectively use both languages together. Written by a team of Oracle ACE Directors, Real-World SQL and PL/SQL: Advice from the Experts features best practices, detailed examples, and insider tips that clearly demonstrate how to write, troubleshoot, and implement code for a wide variety of practical applications. The book thoroughly explains underutilized SQL and PL/SQL functions and lays out essential development strategies. Data modeling, advanced analytics, database security, secure coding, and administration are covered in complete detail. Learn how to: • Apply advanced SQL and PL/SQL tools and techniques • Understand SQL and PL/SQL functionality and determine when to use which language • Develop accurate data models and implement business logic • Run PL/SQL in SQL and integrate complex datasets • Handle PL/SQL instrumenting and profiling • Use Oracle Advanced Analytics and Oracle R Enterprise • Build and execute predictive queries • Secure your data using encryption, hashing, redaction, and masking • Defend against SQL injection and other code-based attacks • Work with Oracle Virtual Private Database Code examples in the book are available for download at www.MHProfessional.com. TAG: For a complete list of Oracle Press titles, visit www.OraclePressBooks.com

Architecting for Access

Fragmented, disparate backend data systems have become the norm in today’s enterprise, where you’ll find a mix of relational databases, Hadoop stores, and NoSQL engines, with access and analytics tools bolted on every which way. This mishmash of options presents a real challenge when it comes to choosing frontend analytics and visualization tools. How did we get here? In this O’Reilly report, IT veteran Rich Morrow takes you through the rapid changes to both backend storage and frontend analytics over the past decade, and provides a pragmatic list of requirements for an analytics stack that will centralize access to all of these data systems. You’ll examine current analytics platforms, including Looker—a new breed of analytics and visualization tools built specifically to handle our fragmented data space. Understand why and how data became so fractured so quickly Explore the tangled web of data and backend tools in today’s enterprises Learn the tool requirements for accessing and analyzing the full spectrum of data Examine the relative strengths of popular analytics and visualization tools, including Looker, Tableau, and MicroStrategy Inspect Looker’s unique focus on both the frontend and backend

Interactive Spark using PySpark

Apache Spark is an in-memory framework that allows data scientists to explore and interact with big data much more quickly than with Hadoop. Python users can work with Spark using an interactive shell called PySpark. Why is it important? PySpark makes the large-scale data processing capabilities of Apache Spark accessible to data scientists who are more familiar with Python than Scala or Java. This also allows for reuse of a wide variety of Python libraries for machine learning, data visualization, numerical analysis, etc. What you'll learn—and how you can apply it Compare the different components provided by Spark, and what use cases they fit. Learn how to use RDDs (resilient distributed datasets) with PySpark. Write Spark applications in Python and submit them to the cluster as Spark jobs. Get an introduction to the Spark computing framework. Apply this approach to a worked example to determine the most frequent airline delays in a specific month and year. This lesson is for you because… You're a data scientist, familiar with Python coding, who needs to get up and running with PySpark You're a Python developer who needs to leverage the distributed computing resources available on a Hadoop cluster, without learning Java or Scala first Prerequisites Familiarity with writing Python applications Some familiarity with bash command-line operations Basic understanding of how to use simple functional programming constructs in Python, such as closures, lambdas, maps, etc. Materials or downloads needed in advance Apache Spark This lesson is taken from by Jenny Kim and Benjamin Bengfort. Data Analytics with Hadoop

Enabling Real-time Analytics on IBM z Systems Platform

Regarding online transaction processing (OLTP) workloads, IBM® z Systems™ platform, with IBM DB2®, data sharing, Workload Manager (WLM), geoplex, and other high-end features, is the widely acknowledged leader. Most customers now integrate business analytics with OLTP by running, for example, scoring functions from transactional context for real-time analytics or by applying machine-learning algorithms on enterprise data that is kept on the mainframe. As a result, IBM adds investment so clients can keep the complete lifecycle for data analysis, modeling, and scoring on z Systems control in a cost-efficient way, keeping the qualities of services in availability, security, reliability that z Systems solutions offer. Because of the changed architecture and tighter integration, IBM has shown, in a customer proof-of-concept, that a particular client was able to achieve an orders-of-magnitude improvement in performance, allowing that client’s data scientist to investigate the data in a more interactive process. Open technologies, such as Predictive Model Markup Language (PMML) can help customers update single components instead of being forced to replace everything at once. As a result, you have the possibility to combine your preferred tool for model generation (such as SAS Enterprise Miner or IBM SPSS® Modeler) with a different technology for model scoring (such as Zementis, a company focused on PMML scoring). IBM SPSS Modeler is a leading data mining workbench that can apply various algorithms in data preparation, cleansing, statistics, visualization, machine learning, and predictive analytics. It has over 20 years of experience and continued development, and is integrated with z Systems. With IBM DB2 Analytics Accelerator 5.1 and SPSS Modeler 17.1, the possibility exists to do the complete predictive model creation including data transformation within DB2 Analytics Accelerator. So, instead of moving the data to a distributed environment, algorithms can be pushed to the data, using cost-efficient DB2 Accelerator for the required resource-intensive operations. This IBM Redbooks® publication explains the overall z Systems architecture, how the components can be installed and customized, how the new IBM DB2 Analytics Accelerator loader can help efficient data loading for z Systems data and external data, how in-database transformation, in-database modeling, and in-transactional real-time scoring can be used, and what other related technologies are available. This book is intended for technical specialists and architects, and data scientists who want to use the technology on the z Systems platform. Most of the technologies described in this book require IBM DB2 for z/OS®. For acceleration of the data investigation, data transformation, and data modeling process, DB2 Analytics Accelerator is required. Most value can be archived if most of the data already resides on z Systems platforms, although adding external data (like from social sources) poses no problem at all.

Implementing an IBM High-Performance Computing Solution on IBM Power System S822LC

This IBM® Redbooks® publication demonstrates and documents that IBM Power Systems™ high-performance computing and technical computing solutions deliver faster time to value with powerful solutions. Configurable into highly scalable Linux clusters, Power Systems offer extreme performance for demanding workloads such as genomics, finance, computational chemistry, oil and gas exploration, and high-performance data analytics. This book delivers a high-performance computing solution implemented on the IBM Power System S822LC. The solution delivers high application performance and throughput based on its built-for-big-data architecture that incorporates IBM POWER8® processors, tightly coupled Field Programmable Gate Arrays (FPGAs) and accelerators, and faster I/O by using Coherent Accelerator Processor Interface (CAPI). This solution is ideal for clients that need more processing power while simultaneously increasing workload density and reducing datacenter floor space requirements. The Power S822LC offers a modular design to scale from a single rack to hundreds, simplicity of ordering, and a strong innovation roadmap for graphics processing units (GPUs). This publication is targeted toward technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) responsible for delivering cost effective high-performance computing (HPC) solutions that help uncover insights from their data so they can optimize business results, product development, and scientific discoveries

IBM Netcool Operations Insight: A Scenarios Guide

IBM® Netcool® Operations Insight empowers your IT operations to use real-time and historical analytics to identify, isolate, and resolve problems before they affect your business. Powered by IBM Tivoli® Netcool/OMNIbus and the transformative capabilities of cognitive analytics, Netcool Operations Insight consolidates millions of alerts from across local, cloud, and hybrid environments into a few actionable problems. This IBM Redbooks® publication gives a broad understanding of Netcool Operations Insight and describes several scenarios that show the capabilities of this solution in a real-life environment. Each scenario features a different capability of Netcool Operations Insight. The scenarios are documented by using step-by-step figures with explanations to make them easier to implement in your own environment. The scenarios in this book are broken into the following categories: - Network Management-related scenarios - Network Event and cognitive-related scenarios - Network Event-related scenarios The target audience of this book is network specialists, network administrators, and network operators.

Perspectives on Data Science for Software Engineering

Perspectives on Data Science for Software Engineering presents the best practices of seasoned data miners in software engineering. The idea for this book was created during the 2014 conference at Dagstuhl, an invitation-only gathering of leading computer scientists who meet to identify and discuss cutting-edge informatics topics. At the 2014 conference, the concept of how to transfer the knowledge of experts from seasoned software engineers and data scientists to newcomers in the field highlighted many discussions. While there are many books covering data mining and software engineering basics, they present only the fundamentals and lack the perspective that comes from real-world experience. This book offers unique insights into the wisdom of the community’s leaders gathered to share hard-won lessons from the trenches. Ideas are presented in digestible chapters designed to be applicable across many domains. Topics included cover data collection, data sharing, data mining, and how to utilize these techniques in successful software projects. Newcomers to software engineering data science will learn the tips and tricks of the trade, while more experienced data scientists will benefit from war stories that show what traps to avoid. Presents the wisdom of community experts, derived from a summit on software analytics Provides contributed chapters that share discrete ideas and technique from the trenches Covers top areas of concern, including mining security and social data, data visualization, and cloud-based data Presented in clear chapters designed to be applicable across many domains

Introducing Microsoft SQL Server 2016: Mission-Critical Applications, Deeper Insights, Hyperscale Cloud

With Microsoft SQL Server 2016, a variety of new features and enhancements to the data platform deliver breakthrough performance, advanced security, and richer, integrated reporting and analytics capabilities. In this ebook, we introduce new security features: Always Encrypted, Row-Level Security, and dynamic data masking; discuss enhancements that enable you to better manage performance and storage: TemDB configuration, query store, and Stretch Database; review several improvements to Reporting Services; and also describe AlwaysOn Availability Groups, tabular enhancements, and R integration.

Relevant Search

Relevant Search demystifies relevance work. Using Elasticsearch, it teaches you how to return engaging search results to your users, helping you understand and leverage the internals of Lucene-based search engines. About the Technology Users are accustomed to and expect instant, relevant search results. To achieve this, you must master the search engine. Yet for many developers, relevance ranking is mysterious or confusing. About the Book Relevant Search demystifies the subject and shows you that a search engine is a programmable relevance framework. You'll learn how to apply Elasticsearch or Solr to your business's unique ranking problems. The book demonstrates how to program relevance and how to incorporate secondary data sources, taxonomies, text analytics, and personalization. In practice, a relevance framework requires softer skills as well, such as collaborating with stakeholders to discover the right relevance requirements for your business. By the end, you'll be able to achieve a virtuous cycle of provable, measurable relevance improvements over a search product's lifetime. What's Inside Techniques for debugging relevance Applying search engine features to real problems Using the user interface to guide searchers A systematic approach to relevance A business culture focused on improving search About the Reader For developers trying to build smarter search with Elasticsearch or Solr. About the Authors Doug Turnbull is lead relevance consultant at OpenSource Connections, where he frequently speaks and blogs. John Berryman is a data engineer at Eventbrite, where he specializes in recommendations and search. Quotes One of the best and most engaging technical books I’ve ever read. - From the Foreword by Trey Grainger, Author of "Solr in Action" Will help you solve real-world search relevance problems for Lucene-based search engines. - Dimitrios Kouzis-Loukas, Bloomberg L.P. An inspiring book revealing the essence and mechanics of relevant search. - Ursin Stauss, Swiss Post Arms you with invaluable knowledge to temper the relevancy of search results and harness the powerful features provided by modern search engines. - Russ Cam, Elastic

Ambient Computing

Consider this scenario: You walk into a building and a sensor identifies you through your mobile phone. You then receive a welcoming text telling you when lunch will be served, or perhaps a health warning based on allergy information you’ve stored in your profile. Maybe you’ll be flagged as a security threat. How is that possible? This O’Reilly report explores ambient computing—hands-free, 24/7 wireless connectivity to hardware, data, and IT systems. Enabling that scenario requires a lot of work behind the scenes to determine network connectivity, device security, and personal privacy. With an ambient-computing technology stack already in the works, resolving those issues is only a matter of time. Through interviews with front-line tech pioneers—including Ari Gesher (Kairos Aerospace) and Matthew Gast (Aerohive Networks)—author Mike Barlow explores how real-time analytics can enable real-time decision making. How will simple beacons broadcast information to your phone as you pass businesses on your morning walk? How can emotional speech analysis monitor the emotional state of employees, students, or people in crowds? Pick up this report and find out.

IBM z13s Technical Guide

Digital business has been driving the transformation of underlying information technology (IT) infrastructure to be more efficient, secure, adaptive, and integrated. IT must be able to handle the explosive growth of mobile clients and employees. It also must be able to process enormous amounts of data to provide deep and real-time insights to help achieve the greatest business impact. This IBM® Redbooks® publication addresses the new IBM z Systems™ single frame, the IBM z13s server. IBM z Systems servers are the trusted enterprise platform for integrating data, transactions, and insight. A data-centric infrastructure must always be available with a 99.999% or better availability, have flawless data integrity, and be secured from misuse. It needs to be an integrated infrastructure that can support new applications. It also needs to have integrated capabilities that can provide new mobile capabilities with real-time analytics delivered by a secure cloud infrastructure. IBM z13s servers are designed with improved scalability, performance, security, resiliency, availability, and virtualization. The superscalar design allows z13s servers to deliver a record level of capacity over the prior single frame z Systems server. In its maximum configuration, the z13s server is powered by up to 20 client characterizable microprocessors (cores) running at 4.3 GHz. This configuration can run more than 18,000 millions of instructions per second (MIPS) and up to 4 TB of client memory. The IBM z13s Model N20 is estimated to provide up to 100% more total system capacity than the IBM zEnterprise® BC12 Model H13. This book provides information about the IBM z13s server and its functions, features, and associated software support. Greater detail is offered in areas relevant to technical planning. It is intended for systems engineers, consultants, planners, and anyone who wants to understand the IBM z Systems™ functions and plan for their usage. It is not intended as an introduction to mainframes. Readers are expected to be generally familiar with existing IBM z Systems technology and terminology.

Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark

Learn the right cutting-edge skills and knowledge to leverage Spark Streaming to implement a wide array of real-time, streaming applications. This book walks you through end-to-end real-time application development using real-world applications, data, and code. Taking an application-first approach, each chapter introduces use cases from a specific industry and uses publicly available datasets from that domain to unravel the intricacies of production-grade design and implementation. The domains covered in Pro Spark Streaming include social media, the sharing economy, finance, online advertising, telecommunication, and IoT. In the last few years, Spark has become synonymous with big data processing. DStreams enhance the underlying Spark processing engine to support streaming analysis with a novel micro-batch processing model. Pro Spark Streaming by Zubair Nabi will enable you to become a specialist of latency sensitive applications by leveraging the key features of DStreams, micro-batch processing, and functional programming. To this end, the book includes ready-to-deploy examples and actual code. Pro Spark Streaming will act as the bible of Spark Streaming. What You'll Learn Discover Spark Streaming application development and best practices Work with the low-level details of discretized streams Optimize production-grade deployments of Spark Streaming via configuration recipes and instrumentation using Graphite, collectd, and Nagios Ingest data from disparate sources including MQTT, Flume, Kafka, Twitter, and a custom HTTP receiver Integrate and couple with HBase, Cassandra, and Redis Take advantage of design patterns for side-effects and maintaining state across the Spark Streaming micro-batch model Implement real-time and scalable ETL using data frames, SparkSQL, Hive, and SparkR Use streaming machine learning, predictive analytics, and recommendations Mesh batch processing with stream processing via the Lambda architecture Who This Book Is For Data scientists, big data experts, BI analysts, and data architects.