talk-data.com talk-data.com

Topic

data-wrangling-preparation-cleaning

9

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: O'Reilly Data Science Books ×
Data Wrangling

DATA WRANGLING Written and edited by some of the world’s top experts in the field, this exciting new volume provides state-of-the-art research and latest technological breakthroughs in data wrangling, its theoretical concepts, practical applications, and tools for solving everyday problems. Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis. This process typically includes manually converting and mapping data from one raw form into another format to allow for more convenient consumption and organization of the data. Data wrangling is increasingly ubiquitous at today’s top firms. Data cleaning focuses on removing inaccurate data from your data set whereas data wrangling focuses on transforming the data’s format, typically by converting “raw” data into another format more suitable for use. Data wrangling is a necessary component of any business. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale, including many applications, such as Datameer, Infogix, Paxata, Talend, Tamr, TMMData, and Trifacta. This book synthesizes the processes of data wrangling into a comprehensive overview, with a strong focus on recent and rapidly evolving agile analytic processes in data-driven enterprises, for businesses and other enterprises to use to find solutions for their everyday problems and practical applications. Whether for the veteran engineer, scientist, or other industry professional, this book is a must have for any library.

The Data Detective's Toolkit

Reduce the cost and time of cleaning, managing, and preparing research data while also improving data quality! Have you ever wished there was an easy way to reduce your workload and improve the quality of your data? The Data Detective’s Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data will help you automate many of the labor-intensive tasks needed to turn raw data into high-quality, analysis-ready data. You will find the right tools and techniques in this book to reduce the amount of time needed to clean, edit, validate, and document your data. These tools include SAS macros as well as ingenious ways of using SAS procedures and functions. The innovative logic built into the book’s macro programs enables you to monitor the quality of your data using information from the formats and labels created for the variables in your data set. The book explains how to harmonize data sets that need to be combined and automate data cleaning tasks to detect errors in data including out-of-range values, inconsistent flow through skip paths, missing data, no variation in values for a variable, and duplicates. By the end of this book, you will be able to automatically produce codebooks, crosswalks, and data catalogs.

Practical Synthetic Data Generation

Building and testing machine learning models requires access to large and diverse data. But where can you find usable datasets without running into privacy issues? This practical book introduces techniques for generating synthetic data—fake data generated from real data—so you can perform secondary analysis to do research, understand customer behaviors, develop new products, or generate new revenue. Data scientists will learn how synthetic data generation provides a way to make such data broadly available for secondary purposes while addressing many privacy concerns. Analysts will learn the principles and steps for generating synthetic data from real datasets. And business leaders will see how synthetic data can help accelerate time to a product or solution. This book describes: Steps for generating synthetic data using multivariate normal distributions Methods for distribution fitting covering different goodness-of-fit metrics How to replicate the simple structure of original data An approach for modeling data structure to consider complex relationships Multiple approaches and metrics you can use to assess data utility How analysis performed on real data can be replicated with synthetic data Privacy implications of synthetic data and methods to assess identity disclosure

Nonlinear Digital Filtering with Python

This book discusses important structural filter classes including the median filter and a number of its extensions (e.g., weighted and recursive median filters), and Volterra filters based on polynomial nonlinearities. Using results from algebra and the theory of functional equations to construct and characterize behaviorally defined nonlinear filter classes, the text first introduces Python programming, and then proposes practical, bottom-up strategies for designing more complex and capable filters from simpler components in a way that preserves the key properties of these components.

Data Wrangling with Python

How do you take your data analysis skills beyond Excel to the next level? By learning just enough Python to get stuff done. This hands-on guide shows non-programmers like you how to process information that’s initially too messy or difficult to access. You don't need to know a thing about the Python programming language to get started. Through various step-by-step exercises, you’ll learn how to acquire, clean, analyze, and present data efficiently. You’ll also discover how to automate your data process, schedule file- editing and clean-up tasks, process larger datasets, and create compelling stories with data you obtain. Quickly learn basic Python syntax, data types, and language concepts Work with both machine-readable and human-consumable data Scrape websites and APIs to find a bounty of useful information Clean and format data to eliminate duplicates and errors in your datasets Learn when to standardize data and when to test and script data cleanup Explore and analyze your datasets with new Python libraries and techniques Use Python solutions to automate your entire data-wrangling process

Data Preparation in the Big Data Era

Preparing and cleaning data is notoriously expensive, prone to error, and time consuming: the process accounts for roughly 80% of the total time spent on analysis. As this O’Reilly report points out, enterprises have already invested billions of dollars in big data analytics, so there’s great incentive to modernize methods for cleaning, combining, and transforming data. Author Federico Castanedo, Chief Data Scientist at WiseAthena.com, details best practices for reducing the time it takes to convert raw data into actionable insights. With these tools and techniques in mind, your organization will be well positioned to translate big data into big decisions. Explore the problems organizations face today with traditional prep and integration Define the business questions you want to address before selecting, prepping, and analyzing data Learn new methods for preparing raw data, including date-time and string data Understand how some cleaning actions (like replacing missing values) affect your analysis Examine data curation products: modern approaches that scale Consider your business audience when choosing ways to deliver your analysis

Using OpenRefine

Using OpenRefine provides a comprehensive guide to managing and cleaning large datasets efficiently. By following a practical, recipe-based approach, this book ensures readers can quickly master OpenRefine's features to enhance their data handling skills. Whether dealing with transformations, entity recognition, or dataset linking, you'll gain the tools to make your data work for you. What this Book will help me do Import and structure various formats of data for seamless processing. Apply both basic and advanced transformations to optimize data quality. Utilize regular expressions for sophisticated filtering and partitioning. Perform named-entity extraction and advanced reconciliation tasks. Master the General Refine Expression Language for powerful data operations. Author(s) The author is an experienced data analyst and educator, specializing in data preparation and transformation for real-world applications. Their approach combines a thorough technical understanding with an accessible teaching style, ensuring that complex concepts are easy to grasp. Who is it for? This book is crafted for anyone working with large datasets, from novices learning to handle and clean data to experienced practitioners seeking advanced techniques. If you aim to improve your data management skills or deliver quality insights from messy data, this book is for you.

Data Clean-Up and Management

Data use in the library has specific characteristics and common problems. Data Clean-up and Management addresses these, and provides methods to clean up frequently-occurring data problems using readily-available applications. The authors highlight the importance and methods of data analysis and presentation, and offer guidelines and recommendations for a data quality policy. The book gives step-by-step how-to directions for common dirty data issues. Focused towards libraries and practicing librarians Deals with practical, real-life issues and addresses common problems that all libraries face Offers cradle-to-grave treatment for preparing and using data, including download, clean-up, management, analysis and presentation