talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

1217

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

1217 activities · Newest first

Sams Teach Yourself Apache Spark™ in 24 Hours

Apache Spark is a fast, scalable, and flexible open source distributed processing engine for big data systems and is one of the most active open source big data projects to date. In just 24 lessons of one hour or less, Sams Teach Yourself Apache Spark in 24 Hours helps you build practical Big Data solutions that leverage Spark’s amazing speed, scalability, simplicity, and versatility. This book’s straightforward, step-by-step approach shows you how to deploy, program, optimize, manage, integrate, and extend Spark–now, and for years to come. You’ll discover how to create powerful solutions encompassing cloud computing, real-time stream processing, machine learning, and more. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success. Whether you are a data analyst, data engineer, data scientist, or data steward, learning Spark will help you to advance your career or embark on a new career in the booming area of Big Data. Learn how to • Discover what Apache Spark does and how it fits into the Big Data landscape • Deploy and run Spark locally or in the cloud • Interact with Spark from the shell • Make the most of the Spark Cluster Architecture • Develop Spark applications with Scala and functional Python • Program with the Spark API, including transformations and actions • Apply practical data engineering/analysis approaches designed for Spark • Use Resilient Distributed Datasets (RDDs) for caching, persistence, and output • Optimize Spark solution performance • Use Spark with SQL (via Spark SQL) and with NoSQL (via Cassandra) • Leverage cutting-edge functional programming techniques • Extend Spark with streaming, R, and Sparkling Water • Start building Spark-based machine learning and graph-processing applications • Explore advanced messaging technologies, including Kafka • Preview and prepare for Spark’s next generation of innovations Instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Spark to solve a wide spectrum of Big Data problems.

In Search of Database Nirvana

The database pendulum is in full swing. Ten years ago, web-scale companies began moving away from proprietary relational databases to handle big data use cases with NoSQL and Hadoop. Now, for a variety of reasons, the pendulum is swinging back toward SQL-based solutions. What many companies really want is a system that can handle all of their operational, OLTP, BI, and analytic workloads. Could such an all-in-one database exist? This O’Reilly report examines this quest for database nirvana, or what Gartner recently dubbed Hybrid Transaction/Analytical Processing (HTAP). Author Rohit Jain takes an in-depth look at the possibilities and the challenges for companies that long for a single query engine to rule them all. With this report, you’ll explore: The challenges of having one query engine support operational, BI, and analytical workloads Efforts to produce a query engine that supports multiple storage engines Attempts to support multiple data models with the same query engine Why an HTAP database engine needs to provide enterprise-caliber capabilities, including high availability, security, and manageability How to assess various options for meeting workload requirements with one database engine, or a combination of query and storage engines

Interactive Spark using PySpark

Apache Spark is an in-memory framework that allows data scientists to explore and interact with big data much more quickly than with Hadoop. Python users can work with Spark using an interactive shell called PySpark. Why is it important? PySpark makes the large-scale data processing capabilities of Apache Spark accessible to data scientists who are more familiar with Python than Scala or Java. This also allows for reuse of a wide variety of Python libraries for machine learning, data visualization, numerical analysis, etc. What you'll learn—and how you can apply it Compare the different components provided by Spark, and what use cases they fit. Learn how to use RDDs (resilient distributed datasets) with PySpark. Write Spark applications in Python and submit them to the cluster as Spark jobs. Get an introduction to the Spark computing framework. Apply this approach to a worked example to determine the most frequent airline delays in a specific month and year. This lesson is for you because… You're a data scientist, familiar with Python coding, who needs to get up and running with PySpark You're a Python developer who needs to leverage the distributed computing resources available on a Hadoop cluster, without learning Java or Scala first Prerequisites Familiarity with writing Python applications Some familiarity with bash command-line operations Basic understanding of how to use simple functional programming constructs in Python, such as closures, lambdas, maps, etc. Materials or downloads needed in advance Apache Spark This lesson is taken from by Jenny Kim and Benjamin Bengfort. Data Analytics with Hadoop

The Data and Analytics Playbook

The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality explores the way in which data continues to dominate budgets, along with the varying efforts made across a variety of business enablement projects, including applications, web and mobile computing, big data analytics, and traditional data integration. The book teaches readers how to use proven methods and accelerators to break through data obstacles to provide faster, higher quality delivery of mission critical programs. Drawing upon years of practical experience, and using numerous examples and an easy to understand playbook, Lowell Fryman, Gregory Lampshire, and Dan Meers discuss a simple, proven approach to the execution of multiple data oriented activities. In addition, they present a clear set of methods to provide reliable governance, controls, risk, and exposure management for enterprise data and the programs that rely upon it. In addition, they discuss a cost-effective approach to providing sustainable governance and quality outcomes that enhance project delivery, while also ensuring ongoing controls. Example activities, templates, outputs, resources, and roles are explored, along with different organizational models in common use today and the ways they can be mapped to leverage playbook data governance throughout the organization. Provides a mature and proven playbook approach (methodology) to enabling data governance that supports agile implementation Features specific examples of current industry challenges in enterprise risk management, including anti-money laundering and fraud prevention Describes business benefit measures and funding approaches using exposure based cost models that augment risk models for cost avoidance analysis and accelerated delivery approaches using data integration sprints for application, integration, and information delivery success

In this session, Eloy Sasot, Head of Analytics, NewsCorp, sat with Vishal Kumar, CEO AnalyticsWeek and shared his journey as an analytics executive, best practices, hacks for upcoming executives, and some challenges/opportunities she's observing as a Chief Analytics Officer.

Timeline:

0:29 Eloy's journey. 4:43 Why work in a publishing house? 7:16 Non-tech industry doing tech stuff. 10:18 Tips for a small business to get started with data science. 13:46 Creating a culture of data science in a company. 17:23 Convincing leaders towards data science. 22:05 Initial days for a leader in creating a data science practice. 27:20 Putting together a data science team. 29:18 Choosing the right tool. 33:00 Keep oneself tool agnostic. 35:20 CDO, CAO, and CTO. 38:58 Defining a data scientist at News Corp. 42:12 Future of data analytics. 46:37 Blaming everything on Big Data.

Podcast Link: https://futureofdata.org/563533-2/

Here's Eloy's Bio: Eloy is the CAO at News Corp, a worldwide network of leading companies in the worlds of diversified media, news, education, and information services, such as The Wall Street Journal, Dow Jones, New York Post, The Times, The Sun, The Australian, HarperCollins, Move, Storyful and Unruly

Prior to this, Eloy led Pricing, Data Science and Data Analytics for HarperCollins Publishers, the second-largest consumer book publisher in the world, with operations in 18 countries, nearly 200 years of history, and more than 65 unique imprints. Since joining HarperCollins in 2011, Eloy pioneered the creation and development of the pricing function, first in the UK, and then its extension to an international scale for the global company. He worked with his teams and each division around the world to drive data-driven decision-making, with a particular focus on Pricing. Besides his global role, he was Board Level Director of HarperCollins UK.

He holds an MBA from INSEAD and a Master’s in Mathematical Engineering from INSA Toulouse.

Follow @eloysasot

The podcast is sponsored by: TAO.ai(https://tao.ai), Artificial Intelligence Driven Career Coach

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Want to Join? If you or any you know wants to join in, Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL

Re-architect relational applications to NoSQL, integrate relational database management systems with the Hadoop ecosystem, and transform and migrate relational data to and from Hadoop components. This book covers the best-practice design approaches to re-architecting your relational applications and transforming your relational data to optimize concurrency, security, denormalization, and performance. Winner of IBM's 2012 Gerstner Award for his implementation of big data and data warehouse initiatives and author of Practical Hadoop Security, author Bhushan Lakhe walks you through the entire transition process. First, he lays out the criteria for deciding what blend of re-architecting, migration, and integration between RDBMS and HDFS best meets your transition objectives. Then he demonstrates how to design your transition model. Lakhe proceeds to cover the selection criteria for ETL tools, the implementation steps for migration with SQOOP- and Flume-based data transfers, and transition optimization techniques for tuning partitions, scheduling aggregations, and redesigning ETL. Finally, he assesses the pros and cons of data lakes and Lambda architecture as integrative solutions and illustrates their implementation with real-world case studies. Hadoop/NoSQL solutions do not offer by default certain relational technology features such as role-based access control, locking for concurrent updates, and various tools for measuring and enhancing performance. Practical Hadoop Migration shows how to use open-source tools to emulate such relational functionalities in Hadoop ecosystem components. What You'll Learn Decide whether you should migrate your relational applications to big data technologies or integrate them Transition your relational applications to Hadoop/NoSQL platforms in terms of logical design and physical implementation Discover RDBMS-to-HDFS integration, data transformation, and optimization techniques Consider when to use Lambda architecture and data lake solutions Select and implement Hadoop-based components and applications to speed transition, optimize integrated performance, and emulate relational functionalities Who This Book Is For Database developers, database administrators, enterprise architects, Hadoop/NoSQL developers, and IT leaders. Its secondary readership is project and program managers and advanced students of database and management information systems.

Big Data Analytics with R

Unlock the potential of big data analytics by mastering R programming with this comprehensive guide. This book takes you step-by-step through real-world scenarios where R's capabilities shine, providing you with practical skills to handle, process, and analyze large and complex datasets effectively. What this Book will help me do Understand the latest big data processing methods and how R can enhance their application. Set up and use big data platforms such as Hadoop and Spark in conjunction with R. Utilize R for practical big data problems, such as analyzing consumption and behavioral datasets. Integrate R with SQL and NoSQL databases to maximize its versatility in data management. Discover advanced machine learning implementations using R and Spark MLlib for predictive analytics. Author(s) None Walkowiak is an experienced data analyst and R programming expert with a passion for data engineering and machine learning. With a deep knowledge of big data platforms and extensive teaching experience, they bring a clear and approachable writing style to help learners excel. Who is it for? Ideal for data analysts, scientists, and engineers with fundamental data analysis knowledge looking to enhance their big data capabilities using R. If you aim to adapt R for large-scale data management and analysis workflows, this book is your ideal companion to bridge the gap.

In this session, Michael O'Connell, Chief Analytics Officer, TIBCO Software, sat with Vishal Kumar, CEO AnalyticsWeek and shared his journey as a Chief Analytics Executive, shared best practices, cultural hacks for upcoming executives, shared his perspective on changing BI landscape and how businesses could leverage that and shared some challenges/opportunities he's observing across various industries.

Timeline:

0:28 Michael's journey. 4:12 CDO, CAO, and CTO. 7:30 Adoption of data analytics capabilities. 9:55 The BI industry dealing with the latest in data analytics. 12:10 Future of stats. 14:58 Creating a center of excellence with data. 18:00 Evolution of data in BI. 21:40 Small businesses getting started with data analytics. 24:35 First steps in the process of becoming a data-driven company. 26:28 Convincing leaders towards data science. 28:20 Shortest route to become a data scientist. 29:49 A typical day in Michael's life.

Podcast Link: https://futureofdata.org/analyticsweek-leadership-podcast-with-michael-oconnell-tibco-software/

Here's Michael's Bio: Michael O’Connell, Chief Analytics Officer, TIBCO Software, developing analytic solutions across a number of industries including Financial Services, Energy, Life Sciences, Consumer Goods & Retail, and Telco, Media & Networks. Michael has been working on analytics software applications for the past 20 years and has published more than 50 papers and several software packages on analytics methodology and applications. Michael did his Ph.D. work in Statistics at North Carolina State University and is Adjunct Professor Statistics in the department.

Follow @michoconnell

The podcast is sponsored by: TAO.ai(https://tao.ai), Artificial Intelligence Driven Career Coach

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Want to Join? If you or any you know wants to join in, Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

The Big Data Market

Which companies have adopted technologies such as Hadoop and Spark, as well as data science in general? And which industries are lagging behind? This O’Reilly report provides the results of a unique, data-driven analysis of the market for big data products and technologies. Using eye-catching charts and visualizations, Spiderbook cofounder Aman Naimat highlights some surprising results from the analysis, such as: The relatively small number of companies using big data in production Industries that have embraced big data the most—and the least The amount of money spent on various big data use cases How many companies actually use “fast data” The results also reveal the geographical locations where companies have been quick to adopt big data, as well as the types of teams that use big data technology. In addition, Naimat takes you through the analysis process with Spiderbook’s graph-based machine-learning model. The company analyzed billions of publicly available documents, canvassed more than 500,000 companies, and searched the entire business internet to compile the most comprehensive results possible.

R: Data Analysis and Visualization

Master the art of building analytical models using R About This Book Load, wrangle, and analyze your data using the world's most powerful statistical programming language Build and customize publication-quality visualizations of powerful and stunning R graphs Develop key skills and techniques with R to create and customize data mining algorithms Use R to optimize your trading strategy and build up your own risk management system Discover how to build machine learning algorithms, prepare data, and dig deep into data prediction techniques with R Who This Book Is For This course is for data scientist or quantitative analyst who are looking at learning R and take advantage of its powerful analytical design framework. It's a seamless journey in becoming a full-stack R developer. What You Will Learn Describe and visualize the behavior of data and relationships between data Gain a thorough understanding of statistical reasoning and sampling Handle missing data gracefully using multiple imputation Create diverse types of bar charts using the default R functions Familiarize yourself with algorithms written in R for spatial data mining, text mining, and so on Understand relationships between market factors and their impact on your portfolio Harness the power of R to build machine learning algorithms with real-world data science applications Learn specialized machine learning techniques for text mining, big data, and more In Detail The R learning path created for you has five connected modules, which are a mini-course in their own right. As you complete each one, you'll have gained key skills and be ready for the material in the next module! This course begins by looking at the Data Analysis with R module. This will help you navigate the R environment. You'll gain a thorough understanding of statistical reasoning and sampling. Finally, you'll be able to put best practices into effect to make your job easier and facilitate reproducibility. The second place to explore is R Graphs, which will help you leverage powerful default R graphics and utilize advanced graphics systems such as lattice and ggplot2, the grammar of graphics. You'll learn how to produce, customize, and publish advanced visualizations using this popular and powerful framework. With the third module, Learning Data Mining with R, you will learn how to manipulate data with R using code snippets and be introduced to mining frequent patterns, association, and correlations while working with R programs. The Mastering R for Quantitative Finance module pragmatically introduces both the quantitative finance concepts and their modeling in R, enabling you to build a tailor-made trading system on your own. By the end of the module, you will be well-versed with various financial techniques using R and will be able to place good bets while making financial decisions. Finally, we'll look at the Machine Learning with R module. With this module, you'll discover all the analytical tools you need to gain insights from complex data and learn how to choose the correct algorithm for your specific needs. You'll also learn to apply machine learning methods to deal with common tasks, including classification, prediction, forecasting, and so on. Style and approach Learn data analysis, data visualization techniques, data mining, and machine learning all using R and also learn to build models in quantitative finance using this powerful language.

Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark

Learn the right cutting-edge skills and knowledge to leverage Spark Streaming to implement a wide array of real-time, streaming applications. This book walks you through end-to-end real-time application development using real-world applications, data, and code. Taking an application-first approach, each chapter introduces use cases from a specific industry and uses publicly available datasets from that domain to unravel the intricacies of production-grade design and implementation. The domains covered in Pro Spark Streaming include social media, the sharing economy, finance, online advertising, telecommunication, and IoT. In the last few years, Spark has become synonymous with big data processing. DStreams enhance the underlying Spark processing engine to support streaming analysis with a novel micro-batch processing model. Pro Spark Streaming by Zubair Nabi will enable you to become a specialist of latency sensitive applications by leveraging the key features of DStreams, micro-batch processing, and functional programming. To this end, the book includes ready-to-deploy examples and actual code. Pro Spark Streaming will act as the bible of Spark Streaming. What You'll Learn Discover Spark Streaming application development and best practices Work with the low-level details of discretized streams Optimize production-grade deployments of Spark Streaming via configuration recipes and instrumentation using Graphite, collectd, and Nagios Ingest data from disparate sources including MQTT, Flume, Kafka, Twitter, and a custom HTTP receiver Integrate and couple with HBase, Cassandra, and Redis Take advantage of design patterns for side-effects and maintaining state across the Spark Streaming micro-batch model Implement real-time and scalable ETL using data frames, SparkSQL, Hive, and SparkR Use streaming machine learning, predictive analytics, and recommendations Mesh batch processing with stream processing via the Lambda architecture Who This Book Is For Data scientists, big data experts, BI analysts, and data architects.

Spark GraphX in Action

Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it interactively. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data. About the Technology GraphX is a powerful graph processing API for the Apache Spark analytics engine that lets you draw insights from large datasets. GraphX gives you unprecedented speed and capacity for running massively parallel and machine learning algorithms. About the Book Spark GraphX in Action begins with the big picture of what graphs can be used for. This example-based tutorial teaches you how to use GraphX interactively. You'll start with a crystal-clear introduction to building big data graphs from regular data, and then explore the problems and possibilities of implementing graph algorithms and architecting graph processing pipelines. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data. What's Inside Understanding graph technology Using the GraphX API Developing algorithms for big graphs Machine learning with graphs Graph visualization About the Reader Readers should be comfortable writing code. Experience with Apache Spark and Scala is not required. About the Authors Michael Malak has worked on Spark applications for Fortune 500 companies since early 2013. Robin East has worked as a consultant to large organizations for over 15 years and is a data scientist at Worldpay. Quotes Learn complex graph processing from two experienced authors…A comprehensive guide. - Gaurav Bhardwaj, 3Pillar Global The best resource to go from GraphX novice to expert in the least amount of time. - Justin Fister, PaperRater A must-read for anyone serious about large-scale graph data mining! - Antonio Magnaghi, OpenMail Reveals the awesome and elegant capabilities of working with linked data for large-scale datasets. - Sumit Pal, Independent consultant

The Data Industry

Provides an introduction of the data industry to the field of economics This book bridges the gap between economics and data science to help data scientists understand the economics of big data, and enable economists to analyze the data industry. It begins by explaining data resources and introduces the data asset. This book defines a data industry chain, enumerates data enterprises’ business models versus operating models, and proposes a mode of industrial development for the data industry. The author describes five types of enterprise agglomerations, and multiple industrial cluster effects. A discussion on the establishment and development of data industry related laws and regulations is provided. In addition, this book discusses several scenarios on how to convert data driving forces into productivity that can then serve society. This book is designed to serve as a reference and training guide for ata scientists, data-oriented managers and executives, entrepreneurs, scholars, and government employees. Defines and develops the concept of a “Data Industry,” and explains the economics of data to data scientists and statisticians Includes numerous case studies and examples from a variety of industries and disciplines Serves as a useful guide for practitioners and entrepreneurs in the business of data technology The Data Industry: The Business and Economics of Information and Big Data is a resource for practitioners in the data science industry, government, and students in economics, business, and statistics. CHUNLEI TANG, Ph.D., is a research fellow at Harvard University. She is the co-founder of Fudan’s Institute for Data Industry and proposed the concept of the “data industry”. She received a Ph.D. in Computer and Software Theory in 2012 and a Master of Software Engineering in 2006 from Fudan University, Shanghai, China.

Python: Real-World Data Science

Unleash the power of Python and its robust data science capabilities About This Book Unleash the power of Python 3 objects Learn to use powerful Python libraries for effective data processing and analysis Harness the power of Python to analyze data and create insightful predictive models Unlock deeper insights into machine learning with this vital guide to cutting-edge predictive analytics Who This Book Is For Entry-level analysts who want to enter in the data science world will find this course very useful to get themselves acquainted with Python's data science capabilities for doing real-world data analysis. What You Will Learn Install and setup Python Implement objects in Python by creating classes and defining methods Get acquainted with NumPy to use it with arrays and array-oriented computing in data analysis Create effective visualizations for presenting your data using Matplotlib Process and analyze data using the time series capabilities of pandas Interact with different kind of database systems, such as file, disk format, Mongo, and Redis Apply data mining concepts to real-world problems Compute on big data, including real-time data from the Internet Explore how to use different machine learning models to ask different questions of your data In Detail The Python: Real-World Data Science course will take you on a journey to become an efficient data science practitioner by thoroughly understanding the key concepts of Python. This learning path is divided into four modules and each module are a mini course in their own right, and as you complete each one, you'll have gained key skills and be ready for the material in the next module. The course begins with getting your Python fundamentals nailed down. After getting familiar with Python core concepts, it's time that you dive into the field of data science. In the second module, you'll learn how to perform data analysis using Python in a practical and example-driven way. The third module will teach you how to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis to more complex data types including text, images, and graphs. Machine learning and predictive analytics have become the most important approaches to uncover data gold mines. In the final module, we'll discuss the necessary details regarding machine learning concepts, offering intuitive yet informative explanations on how machine learning algorithms work, how to use them, and most importantly, how to avoid the common pitfalls. Style and approach This course includes all the resources that will help you jump into the data science field with Python and learn how to make sense of data. The aim is to create a smooth learning path that will teach you how to get started with powerful Python libraries and perform various data science techniques in depth.

Big Data

Big Data: Principles and Paradigms captures the state-of-the-art research on the architectural aspects, technologies, and applications of Big Data. The book identifies potential future directions and technologies that facilitate insight into numerous scientific, business, and consumer applications. To help realize Big Data’s full potential, the book addresses numerous challenges, offering the conceptual and technological solutions for tackling them. These challenges include life-cycle data management, large-scale storage, flexible processing infrastructure, data modeling, scalable machine learning, data analysis algorithms, sampling techniques, and privacy and ethical issues. Covers computational platforms supporting Big Data applications Addresses key principles underlying Big Data computing Examines key developments supporting next generation Big Data platforms Explores the challenges in Big Data computing and ways to overcome them Contains expert contributors from both academia and industry

Implementing an Optimized Analytics Solution on IBM Power Systems

This IBM® Redbooks® publication addresses topics to use the virtualization strengths of the IBM POWER8® platform to solve clients' system resource utilization challenges and maximize systems' throughput and capacity. This book addresses performance tuning topics that will help answer clients' complex analytic workload requirements, help maximize systems' resources, and provide expert-level documentation to transfer the how-to-skills to the worldwide teams. This book strengthens the position of IBM Analytics and Big Data solutions with a well-defined and documented deployment model within a POWER8 virtualized environment, offering clients a planned foundation for security, scaling, capacity, resilience, and optimization for analytics workloads. This book is targeted toward technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing analytics solutions and support on IBM Power Systems™.

Apache Spark Machine Learning Blueprints

In 'Apache Spark Machine Learning Blueprints', you'll explore how to create sophisticated and scalable machine learning projects using Apache Spark. This project-driven guide covers practical applications including fraud detection, customer analysis, and recommendation engines, helping you leverage Spark's capabilities for advanced data science tasks. What this Book will help me do Learn to set up Apache Spark efficiently for machine learning projects, unlocking its powerful processing capabilities. Integrate Apache Spark with R for detailed analytical insights, empowering your decision-making processes. Create predictive models for use cases including customer scoring, fraud detection, and risk assessment with practical implementations. Understand and utilize Spark's parallel computing architecture for large-scale machine learning tasks. Develop and refine recommendation systems capable of handling large user bases and datasets using Spark. Author(s) Alex Liu is a seasoned data scientist and software developer specializing in machine learning and big data technology. With extensive experience in using Apache Spark for predictive analytics, Alex has successfully built and deployed scalable solutions across industries. Their teaching approach combines theory and practical insights, making cutting-edge technologies accessible and actionable. Who is it for? This book is ideal for data analysts, data scientists, and developers with a foundation in machine learning who are eager to apply their knowledge in big data contexts. If you have a basic familiarity with Apache Spark and its ecosystem, and you're looking to enhance your ability to build machine learning applications, this resource is for you. It's particularly valuable for those aiming to utilize Spark for extensive data operations and gain practical, project-based insights.

Streaming Architecture

More and more data-driven companies are looking to adopt stream processing and streaming analytics. With this concise ebook, you’ll learn best practices for designing a reliable architecture that supports this emerging big-data paradigm. Authors Ted Dunning and Ellen Friedman (Real World Hadoop) help you explore some of the best technologies to handle stream processing and analytics, with a focus on the upstream queuing or message-passing layer. To illustrate the effectiveness of these technologies, this book also includes specific use cases. Ideal for developers and non-technical people alike, this book describes: Key elements in good design for streaming analytics, focusing on the essential characteristics of the messaging layer New messaging technologies, including Apache Kafka and MapR Streams, with links to sample code Technology choices for streaming analytics: Apache Spark Streaming, Apache Flink, Apache Storm, and Apache Apex How stream-based architectures are helpful to support microservices Specific use cases such as fraud detection and geo-distributed data streams Ted Dunning is Chief Applications Architect at MapR Technologies, and active in the open source community. He currently serves as VP for Incubator at the Apache Foundation, as a champion and mentor for a large number of projects, and as committer and PMC member of the Apache ZooKeeper and Drill projects. Ted is on Twitter as @ted_dunning. Ellen Friedman, a committer for the Apache Drill and Apache Mahout projects, is a solutions consultant and well-known speaker and author, currently writing mainly about big data topics. With a PhD in Biochemistry, she has years of experience as a research scientist and has written about a variety of technical topics. Ellen is on Twitter as @Ellen_Friedman.

Professional Hadoop

The professional's one-stop guide to this open-source, Java-based big data framework Professional Hadoop is the complete reference and resource for experienced developers looking to employ Apache Hadoop in real-world settings. Written by an expert team of certified Hadoop developers, committers, and Summit speakers, this book details every key aspect of Hadoop technology to enable optimal processing of large data sets. Designed expressly for the professional developer, this book skips over the basics of database development to get you acquainted with the framework's processes and capabilities right away. The discussion covers each key Hadoop component individually, culminating in a sample application that brings all of the pieces together to illustrate the cooperation and interplay that make Hadoop a major big data solution. Coverage includes everything from storage and security to computing and user experience, with expert guidance on integrating other software and more. Hadoop is quickly reaching significant market usage, and more and more developers are being called upon to develop big data solutions using the Hadoop framework. This book covers the process from beginning to end, providing a crash course for professionals needing to learn and apply Hadoop quickly. Configure storage, UE, and in-memory computing Integrate Hadoop with other programs including Kafka and Storm Master the fundamentals of Apache Big Top and Ignite Build robust data security with expert tips and advice Hadoop's popularity is largely due to its accessibility. Open-source and written in Java, the framework offers almost no barrier to entry for experienced database developers already familiar with the skills and requirements real-world programming entails. Professional Hadoop gives you the practical information and framework-specific skills you need quickly.

Threat Forecasting

Drawing upon years of practical experience and using numerous examples and illustrative case studies, Threat Forecasting: Leveraging Big Data for Predictive Analysis discusses important topics, including the danger of using historic data as the basis for predicting future breaches, how to use security intelligence as a tool to develop threat forecasting techniques, and how to use threat data visualization techniques and threat simulation tools. Readers will gain valuable security insights into unstructured big data, along with tactics on how to use the data to their advantage to reduce risk. Presents case studies and actual data to demonstrate threat data visualization techniques and threat simulation tools Explores the usage of kill chain modelling to inform actionable security intelligence Demonstrates a methodology that can be used to create a full threat forecast analysis for enterprise networks of any size