talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

1217

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

1217 activities · Newest first

Learning YARN

"Learning YARN" is your comprehensive guide to master YARN, the resource management layer in the Hadoop ecosystem. Through the book, you'll leverage YARN's capabilities for big data processing, learning to deploy, manage, and scale Hadoop-YARN clusters. What this Book will help me do Understand the main features and benefits of the YARN framework. Gain experience managing Hadoop clusters of varying sizes. Learn to integrate YARN with domain-specific big data tools like Spark. Become skilled at administration and configuration of YARN. Develop and run your own YARN-based applications for distributed computing. Author(s) Akhil Arora and Shrey Mehrotra bring with them years of experience working in big data frameworks and technologies. With expertise in YARN specifically, they aim to bridge the gap for developers and administrators to learn and implement scalable big data solutions. Their extensive knowledge in cluster management and distributed data processing shines through in how this book is structured and detailed. Who is it for? This book is ideal for software developers, big data engineers, and system administrators interested in advancing their knowledge in resource management in Hadoop systems. If you have basic familiarity with Hadoop and need a deeper understanding or feature knowledge of YARN for professional growth, this book is tailored for you. It is also suitable for learners seeking to integrate big data platforms like Spark into YARN clusters.

Structured Search for Big Data

The WWW era made billions of people dramatically dependent on the progress of data technologies, out of which Internet search and Big Data are arguably the most notable. Structured Search paradigm connects them via a fundamental concept of key-objects evolving out of keywords as the units of search. The key-object data model and KeySQL revamp the data independence principle making it applicable for Big Data and complement NoSQL with full-blown structured querying functionality. The ultimate goal is extracting Big Information from the Big Data. As a Big Data Consultant, Mikhail Gilula combines academic background with 20 years of industry experience in the database and data warehousing technologies working as a Sr. Data Architect for Teradata, Alcatel-Lucent, and PayPal, among others. He has authored three books, including The Set Model for Database and Information Systems and holds four US Patents in Structured Search and Data Integration. Conceptualizes structured search as a technology for querying multiple data sources in an independent and scalable manner. Explains how NoSQL and KeySQL complement each other and serve different needs with respect to big data Shows the place of structured search in the internet evolution and describes its implementations including the real-time structured internet search

Emerging Trends in Computational Biology, Bioinformatics, and Systems Biology

Emerging Trends in Computational Biology, Bioinformatics, and Systems Biology discusses the latest developments in all aspects of computational biology, bioinformatics, and systems biology and the application of data-analytics and algorithms, mathematical modeling, and simu- lation techniques. • Discusses the development and application of data-analytical and theoretical methods, mathematical modeling, and computational simulation techniques to the study of biological and behavioral systems, including applications in cancer research, computational intelligence and drug design, high-performance computing, and biology, as well as cloud and grid computing for the storage and access of big data sets. • Presents a systematic approach for storing, retrieving, organizing, and analyzing biological data using software tools with applications to general principles of DNA/RNA structure, bioinformatics and applications, genomes, protein structure, and modeling and classification, as well as microarray analysis. • Provides a systems biology perspective, including general guidelines and techniques for obtaining, integrating, and analyzing complex data sets from multiple experimental sources using computational tools and software. Topics covered include phenomics, genomics, epigenomics/epigenetics, metabolomics, cell cycle and checkpoint control, and systems biology and vaccination research. • Explains how to effectively harness the power of Big Data tools when data sets are so large and complex that it is difficult to process them using conventional database management systems or traditional data processing applications. Discusses the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological and behavioral systems. Presents a systematic approach for storing, retrieving, organizing and analyzing biological data using software tools with applications. Provides a systems biology perspective including general guidelines and techniques for obtaining, integrating and analyzing complex data sets from multiple experimental sources using computational tools and software.

Pro Couchbase Development: A NoSQL Platform for the Enterprise

Pro Couchbase Development: A NoSQL Platform for the Enterprise discusses programming for Couchbase using Java and scripting languages, querying and searching, handling migration, and integrating Couchbase with Hadoop, HDFS, and JSON. It also discusses migration from other NoSQL databases like MongoDB. This book is for big data developers who use Couchbase NoSQL database or want to use Couchbase for their web applications as well as for those migrating from other NoSQL databases like MongoDB and Cassandra. For example, a reason to migrate from Cassandra is that it is not based on the JSON document model with support for a flexible schema without having to define columns and supercolumns. The target audience is largely Java developers but the book also supports PHP and Ruby developers who want to learn about Couchbase. The author supplies examples in Java, PHP, Ruby, and JavaScript. After reading and using this hands-on guide for developing with Couchbase, you'll be able to build complex enterprise, database and cloud applications that leverage this powerful platform.

Machine Learning with R - Second Edition

Machine Learning with R (Second Edition) provides a thorough introduction to machine learning techniques and their application using the R programming language. You'll gain hands-on experience implementing various algorithms and solving real-world data challenges, making it an invaluable resource for aspiring data scientists and analysts. What this Book will help me do Understand the fundamentals of machine learning and its applications in data analysis. Master the use of R for cleaning, exploring, and visualizing data to prepare it for modeling. Build and apply machine learning models for classification, prediction, and clustering tasks. Evaluate and fine-tune model performance to ensure accurate predictions. Explore advanced topics like text mining, handling social network data, and big data analytics. Author(s) Brett Lantz is a data scientist with significant experience as both a practitioner and communicator in the machine learning field. With a focus on accessibility, he aims to demystify complex concepts for readers interested in data science. His blend of hands-on methods and theoretical insight has made his work a favorite for both beginners and experienced professionals. Who is it for? Ideal for data analysts and aspiring data scientists who have intermediate programming skills and are exploring machine learning. Perfect for R users ready to expand their skill set to include predictive modeling techniques. Also fits those with some experience in machine learning but new to the R environment. Provides insightful guidance for anyone looking to apply machine learning in practical, real-world scenarios.

podcast_episode
by Kyle Polich , Benjamin Uminsky (Los Angeles County Registrar-Recorder/County Clerk)

In this episode, Benjamin Uminsky enlightens us about some of the ways the Los Angeles County Registrar-Recorder/County Clerk leverages data science and analysis to help be more effective and efficient with the services and expectations they provide citizens. Our topics range from forecasting to predicting the likelihood that people will volunteer to be poll workers. Benjamin recently spoke at Big Data Day LA. Videos have not yet been posted, but you can see the slides from his talk Data Mining Forecasting and BI at the RRCC if this episode has left you hungry to learn more. During the show, Benjamin encouraged any Los Angeles residents who have some time to serve their community consider becoming a pollworker.

Spark Cookbook

Spark Cookbook is your practical guide to mastering Apache Spark, encompassing a comprehensive set of patterns and examples. Through its over 60 recipes, you will gain actionable insights into using Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX effectively for your big data needs. What this Book will help me do Understand how to install and configure Apache Spark in various environments. Build data pipelines and perform real-time analytics with Spark Streaming. Utilize Spark SQL for interactive data querying and reporting. Apply machine learning workflows using MLlib, including supervised and unsupervised models. Develop optimized big data solutions and integrate them into enterprise platforms. Author(s) None Yadav, the author of Spark Cookbook, is an experienced data engineer and technical expert with deep insights into big data processing frameworks. Yadav has spent years working with Spark and its ecosystem, providing practical guidance to developers and data scientists alike. This book reflects their commitment to sharing actionable knowledge. Who is it for? This book is designed for data engineers, developers, and data scientists who work with big data systems and wish to utilize Apache Spark effectively. Whether you're looking to optimize existing Spark applications or explore its libraries for new use cases, this book will provide the guidance you need. A basic familiarity with big data concepts and programming in languages like Java or Python is recommended to make the most out of this book.

Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

Plan and Implement Hadoop Virtualization for Maximum Performance, Scalability, and Business Agility Enterprises running Hadoop must absorb rapid changes in big data ecosystems, frameworks, products, and workloads. Virtualized approaches can offer important advantages in speed, flexibility, and elasticity. Now, a world-class team of enterprise virtualization and big data experts guide you through the choices, considerations, and tradeoffs surrounding Hadoop virtualization. The authors help you decide whether to virtualize Hadoop, deploy Hadoop in the cloud, or integrate conventional and virtualized approaches in a blended solution. First, Virtualizing Hadoop reviews big data and Hadoop from the standpoint of the virtualization specialist. The authors demystify MapReduce, YARN, and HDFS and guide you through each stage of Hadoop data management. Next, they turn the tables, introducing big data experts to modern virtualization concepts and best practices. Finally, they bring Hadoop and virtualization together, guiding you through the decisions you’ll face in planning, deploying, provisioning, and managing virtualized Hadoop. From security to multitenancy to day-to-day management, you’ll find reliable answers for choosing your best Hadoop strategy and executing it. Coverage includes the following: • Reviewing the frameworks, products, distributions, use cases, and roles associated with Hadoop • Understanding YARN resource management, HDFS storage, and I/O • Designing data ingestion, movement, and organization for modern enterprise data platforms • Defining SQL engine strategies to meet strict SLAs • Considering security, data isolation, and scheduling for multitenant environments • Deploying Hadoop as a service in the cloud • Reviewing the essential concepts, capabilities, and terminology of virtualization • Applying current best practices, guidelines, and key metrics for Hadoop virtualization • Managing multiple Hadoop frameworks and products as one unified system • Virtualizing master and worker nodes to maximize availability and performance • Installing and configuring Linux for a Hadoop environment

This mini-episode is a high level explanation of the basic idea behind MapReduce, which is a fundamental concept in big data. The origin of the idea comes from a Google paper titled MapReduce: Simplified Data Processing on Large Clusters. This episode makes an analogy to tabulating paper voting ballets as a means of helping to explain how and why MapReduce is an important concept.

IBM Software Defined Infrastructure for Big Data Analytics Workloads

This IBM® Redbooks® publication documents how IBM Platform Computing, with its IBM Platform Symphony® MapReduce framework, IBM Spectrum Scale (based Upon IBM GPFS™), IBM Platform LSF®, the Advanced Service Controller for Platform Symphony are work together as an infrastructure to manage not just Hadoop-related offerings, but many popular industry offeringsm such as Apach Spark, Storm, MongoDB, Cassandra, and so on. It describes the different ways to run Hadoop in a big data environment, and demonstrates how IBM Platform Computing solutions, such as Platform Symphony and Platform LSF with its MapReduce Accelerator, can help performance and agility to run Hadoop on distributed workload managers offered by IBM. This information is for technical professionals (consultants, technical support staff, IT architects, and IT specialists) who are responsible for delivering cost-effective cloud services and big data solutions on IBM Power Systems™ to help uncover insights among client’s data so they can optimize product development and business results.

Mastering Matplotlib

Mastering Matplotlib provides readers with the tools to not just create visualizations but to fully harness the capabilities of the Matplotlib library. You will explore advanced features, work on interactive visualizations, and learn to optimize plots for various platforms and datasets. By the end, you will be adept at using Matplotlib in complex projects involving data analysis and visualization. What this Book will help me do Understand the architecture and internals of Matplotlib to better utilize and extend its features. Develop visually dynamic and interactive plots that update in real-time with changes in the user interface. Leverage third-party libraries to visualize complex datasets and relationships efficiently. Create tailored styling for visualizations, meeting publication and presentation standards. Deploy and integrate Matplotlib-based visualizations into cloud environments and big data workflows seamlessly. Author(s) Duncan M. McGreggor is a seasoned software engineer with years of hands-on experience in data visualization and scientific computing. He specializes in utilizing Matplotlib for dynamic charting and advanced plotting use cases. His approach to writing focuses on empowering readers to apply and integrate visualization solutions in real-world scenarios. Who is it for? This book is ideal for scientists, software engineers, programmers, and students who have a foundational understanding of Matplotlib and are looking to take their skills to an advanced level. If you're aiming to leverage Matplotlib to handle intricate datasets or to create sophisticated visual representations, this book is for you. It caters to learners seeking practical guidance for professional or academic projects. Expand your visualization toolkit with this insightful guide.

Bioinformatics with Python Cookbook

Dive into the intersection of biology and data science with 'Bioinformatics with Python Cookbook.' This book equips you to leverage Python and its ecosystem of libraries to tackle complex challenges in computational biology, covering topics like genomics, phylogenetics, and big data bioinformatics. What this Book will help me do Understand the Python ecosystem specifically tailored for computational biology applications. Analyze and visualize next-generation sequencing data effectively. Explore and simulate population genetics for robust biological research. Utilize the Protein Data Bank to extract critical insights about proteins. Handle big genomics datasets with Python tools for large-scale bioinformatics studies. Author(s) Tiago Antao is an established bioinformatician with expertise in Python programming. With years of practical experience in computational biology, he has tailored this cookbook with detailed and actionable examples. Tiago's mission is to make bioinformatic techniques using Python accessible to researchers of varying skill levels. Who is it for? This book is ideal for researchers, biologists, and data scientists with intermediate Python skills looking to expand their expertise in bioinformatics. It caters to professionals wanting to utilize computational tools for solving biological problems. If you're involved in work or study related to genomics, phylogenetics, or large-scale biology datasets, this guide offers practical solutions. Make the most out of Python in your research journey.

Implementing an IBM InfoSphere BigInsights Cluster using Linux on Power

This IBM® Redbooks® publication demonstrates and documents how to implement and manage an IBM PowerLinux™ cluster for big data focusing on hardware management, operating systems provisioning, application provisioning, cluster readiness check, hardware, operating system, IBM InfoSphere® BigInsights™, IBM Platform Symphony®, IBM Spectrum™ Scale (formerly IBM GPFS™), applications monitoring, and performance tuning. This publication shows that IBM PowerLinux clustering solutions (hardware and software) deliver significant value to clients that need cost-effective, highly scalable, and robust solutions for big data and analytics workloads. This book documents and addresses topics on how to use IBM Platform Cluster Manager to manage PowerLinux BigData data clusters through IBM InfoSphere BigInsights, Spectrum Scale, and Platform Symphony. This book documents how to set up and manage a big data cluster on PowerLinux servers to customize application and programming solutions, and to tune applications to use IBM hardware architectures. This document uses the architectural technologies and the software solutions that are available from IBM to help solve challenging technical and business problems. This book is targeted at technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) that are responsible for delivering cost-effective Linux on IBM Power Systems™ solutions that help uncover insights among client's data so they can act to optimize business results, product development, and scientific discoveries.

Designing and Operating a Data Reservoir

Together, big data and analytics have tremendous potential to improve the way we use precious resources, to provide more personalized services, and to protect ourselves from unexpected and ill-intentioned activities. To fully use big data and analytics, an organization needs a system of insight. This is an ecosystem where individuals can locate and access data, and build visualizations and new analytical models that can be deployed into the IT systems to improve the operations of the organization. The data that is most valuable for analytics is also valuable in its own right and typically contains personal and private information about key people in the organization such as customers, employees, and suppliers. Although universal access to data is desirable, safeguards are necessary to protect people's privacy, prevent data leakage, and detect suspicious activity. The data reservoir is a reference architecture that balances the desire for easy access to data with information governance and security. The data reservoir reference architecture describes the technical capabilities necessary for a system of insight, while being independent of specific technologies. Being technology independent is important, because most organizations already have investments in data platforms that they want to incorporate in their solution. In addition, technology is continually improving, and the choice of technology is often dictated by the volume, variety, and velocity of the data being managed. A system of insight needs more than technology to succeed. The data reservoir reference architecture includes description of governance and management processes and definitions to ensure the human and business systems around the technology support a collaborative, self-service, and safe environment for data use. The data reservoir reference architecture was first introduced in Governing and Managing Big Data for Analytics and Decision Makers, REDP-5120, which is available at: http://www.redbooks.ibm.com/redpieces/abstracts/redp5120.html. This IBM® Redbooks publication, Designing and Operating a Data Reservoir, builds on that material to provide more detail on the capabilities and internal workings of a data reservoir.

IBM Spectrum Scale (formerly GPFS)

This IBM® Redbooks® publication updates and complements the previous publication: Implementing the IBM General Parallel File System in a Cross Platform Environment, SG24-7844, with additional updates since the previous publication version was released with IBM General Parallel File System (GPFS™). Since then, two releases have been made available up to the latest version of IBM Spectrum™ Scale 4.1. Topics such as what is new in Spectrum Scale, Spectrum Scale licensing updates (Express/Standard/Advanced), Spectrum Scale infrastructure support/updates, storage support (IBM and OEM), operating system and platform support, Spectrum Scale global sharing - Active File Management (AFM), and considerations for the integration of Spectrum Scale in IBM Tivoli® Storage Manager (Spectrum Protect) backup solutions are discussed in this new IBM Redbooks publication. This publication provides additional topics such as planning, usability, best practices, monitoring, problem determination, and so on. The main concept for this publication is to bring you up to date with the latest features and capabilities of IBM Spectrum Scale as the solution has become a key component of the reference architecture for clouds, analytics, mobile, social media, and much more. This publication targets technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) responsible for delivering cost effective cloud services and big data solutions on IBM Power Systems™ helping to uncover insights among clients' data so they can take actions to optimize business results, product development, and scientific discoveries.

Navigating the Health Data Ecosystem

Data-driven technologies are now being adopted, developed, funded, and deployed throughout the health care market at an unprecedented scale. But, as this O'Reilly report reveals, health care innovation contains more hurdles and requires more finesse than many tech startups expect. By paying attention to the lessons from the report's findings, innovation teams can better anticipate what they'll face, and plan accordingly. Simply put, teams looking to apply collective intelligence and "big data" platforms to health and health care problems often don't appreciate the messy details of using and making sense of data in the heavily regulated hospital IT environment. Download this report today and learn how it helps prepare startups in six areas: Complexity: An enormous domain with noisy data not designed for machine consumption Computing: Lack of standard, interoperable schema for documenting human health in a digital format Context: Lack of critical contextual metadata for interpreting health data Culture: Startup difficulties in hospital ecosystems: why innovation can be a two-edged sword Contracts: Navigating the IRB, HIPAA, and EULA frameworks Commerce: The problem of how digital health startups get paid This report represents the initial findings of a study funded by a grant from the Robert Wood Johnson Foundation. Subsequent reports will explore the results of three deep-dive projects the team pursued during the study.

Statistical Learning with Sparsity

Discover New Methods for Dealing with High-Dimensional Data A sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and interpret than a dense model. Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data. Top experts in this rapidly evolving field, the authors describe the lasso for linear regression and a simple coordinate descent algorithm for its computation. They discuss the application of ℓ 1 penalties to generalized linear models and support vector machines, cover generalized penalties such as the elastic net and group lasso, and review numerical methods for optimization. They also present statistical inference methods for fitted (lasso) models, including the bootstrap, Bayesian methods, and recently developed approaches. In addition, the book examines matrix decomposition, sparse multivariate analysis, graphical models, and compressed sensing. It concludes with a survey of theoretical results for the lasso. In this age of big data, the number of features measured on a person or object can be large and might be larger than the number of observations. This book shows how the sparsity assumption allows us to tackle these problems and extract useful and reproducible patterns from big datasets. Data analysts, computer scientists, and theorists will appreciate this thorough and up-to-date treatment of sparse statistical modeling.

Current State of Big Data Use in Retail Supply Chains

Innovation, consisting of invention, adoption, and deployment of new technology and associated process improvements, is a key source of competitive advantages. Big Data is an innovation that has been gaining prominence in retailing and other industries. In fact, managers working in retail supply chain member firms (that is, retailers, manufacturers, distributors, wholesalers, logistics providers, and other service providers) have increasingly been trying to understand what Big Data entails, what it may be used for, and how to make it an integral part of their businesses. This report covers Big Data use, with focus on applications for retail supply chains. The authors’ findings suggest that Big Data use in retail supply chains is still generally elusive. Although most managers have reported initial, and in some cases some significant efforts in analyzing large sets of data for decision making, various challenges confine these data to a range of use spanning traditional, transactional data.

Big Data

Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built. About the Technology About the Book Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive. Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm, and NoSQL databases. What's Inside Introduction to big data systems Real-time processing of web-scale data Tools like Hadoop, Cassandra, and Storm Extensions to traditional database skills About the Reader This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful. About the Authors Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. James Warren is an analytics architect with a background in machine learning and scientific computing. Quotes Transcends individual tools or platforms. Required reading for anyone working with big data systems. - Jonathan Esterhazy, Groupon A comprehensive, example-driven tour of the Lambda Architecture with its originator as your guide. - Mark Fisher, Pivotal Contains wisdom that can only be gathered after tackling many big data projects. A must-read. - Pere Ferrera Bertran, Datasalt The de facto guide to streamlining your data pipeline in batch and near-real time. - Alex Holmes, Author of "Hadoop in Practice"

Hadoop Essentials

In 'Hadoop Essentials,' you'll embark on an engaging journey to master the Hadoop ecosystem. This book covers fundamental to advanced topics, from HDFS and MapReduce to real-time analytics with Spark, empowering you to handle modern data challenges efficiently. What this Book will help me do Understand the core components of Hadoop, including HDFS, YARN, and MapReduce, for foundational knowledge. Learn to optimize Big Data architectures and improve application performance. Utilize tools like Hive and Pig for efficient data querying and processing. Master data ingestion technologies like Sqoop and Flume for seamless data management. Achieve fluency in real-time data analytics using modern tools like Apache Spark and Apache Storm. Author(s) None Achari is a seasoned expert in Big Data and distributed systems with in-depth knowledge of the Hadoop ecosystem. With years of experience in both development and teaching, they craft content that bridges practical know-how with theoretical insights in a highly accessible style. Who is it for? This book is perfect for system and application developers aiming to learn practical applications of Hadoop. It suits professionals seeking solutions to real-world Big Data challenges as well as those familiar with distributed systems basics and looking to deepen their expertise in advanced data analysis.