talk-data.com talk-data.com

Topic

Data Quality

data_management data_cleansing data_validation

537

tagged

Activity Trend

82 peak/qtr
2020-Q1 2026-Q1

Activities

537 activities · Newest first

Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

The Complete Guide to Data Science with Hadoop—For Technical Professionals, Businesspeople, and Students Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials. Practical Data Science with Hadoop® and Spark The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization. Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP). This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives. Learn What data science is, how it has evolved, and how to plan a data science career How data volume, variety, and velocity shape data science use cases Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark Data importation with Hive and Spark Data quality, preprocessing, preparation, and modeling Visualization: surfacing insights from huge data sets Machine learning: classification, regression, clustering, and anomaly detection Algorithms and Hadoop tools for predictive modeling Cluster analysis and similarity functions Large-scale anomaly detection NLP: applying data science to human language

SAS Data Analytic Development

Design quality SAS software and evaluate SAS software quality SAS Data Analytic Development is the developer’s compendium for writing better-performing software and the manager’s guide to building comprehensive software performance requirements. The text introduces and parallels the International Organization for Standardization (ISO) software product quality model, demonstrating 15 performance requirements that represent dimensions of software quality, including: reliability, recoverability, robustness, execution efficiency (i.e., speed), efficiency, scalability, portability, security, automation, maintainability, modularity, readability, testability, stability, and reusability. The text is intended to be read cover-to-cover or used as a reference tool to instruct, inspire, deliver, and evaluate software quality. A common fault in many software development environments is a focus on functional requirements—the what and how—to the detriment of performance requirements, which specify instead how well software should function (assessed through software execution) or how easily software should be maintained (assessed through code inspection). Without the definition and communication of performance requirements, developers risk either building software that lacks intended quality or wasting time delivering software that exceeds performance objectives—thus, either underperforming or gold-plating, both of which are undesirable. Managers, customers, and other decision makers should also understand the dimensions of software quality both to define performance requirements at project outset as well as to evaluate whether those objectives were met at software completion. As data analytic software, SAS transforms data into information and ultimately knowledge and data-driven decisions. Not surprisingly, data quality is a central focus and theme of SAS literature; however, code quality is far less commonly described and too often references only the speed or efficiency with which software should execute, omitting other critical dimensions of software quality. SAS® software project definitions and technical requirements often fall victim to this paradox, in which rigorous quality requirements exist for data and data products yet not for the software that undergirds them. By demonstrating the cost and benefits of software quality inclusion and the risk of software quality exclusion, stakeholders learn to value, prioritize, implement, and evaluate dimensions of software quality within risk management and project management frameworks of the software development life cycle (SDLC). Thus, SAS Data Analytic Development recalibrates business value, placing code quality on par with data quality, and performance requirements on par with functional requirements.

Making Sense of Stream Processing

How can event streams help make your application more scalable, reliable, and maintainable? In this report, O’Reilly author Martin Kleppmann shows you how stream processing can make your data storage and processing systems more flexible and less complex. Structuring data as a stream of events isn’t new, but with the advent of open source projects such as Apache Kafka and Apache Samza, stream processing is finally coming of age. Using several case studies, Kleppmann explains how these projects can help you reorient your database architecture around streams and materialized views. The benefits of this approach include better data quality, faster queries through precomputed caches, and real-time user interfaces. Learn how to open up your data for richer analysis and make your applications more scalable and robust in the face of failures. Understand stream processing fundamentals and their similarities to event sourcing, CQRS, and complex event processing Learn how logs can make search indexes and caches easier to maintain Explore the integration of databases with event streams, using the new Bottled Water open source tool Turn your database architecture inside out by orienting it around streams and materialized views

Data quality is often taken for granted. Many organizations fall into complacency with tools like Google Analytics, where tracking is installed but rarely optimized, configured, or scrutinized. As it turns out, this type of plug-and-play analytics can be detrimental to your measurement strategy. In this talk, Simo will show his experiences of working with vastly different organizations and methodologies for tag management, highlighting the format with which he's had most success. He will also showcase how a basic setup of Google Analytics (or any other popular web analytics platform) is simply not enough, together with a case study or two of how to turn the limitations of these platforms to your advantage.

The loss of credibility and influence tied to delivering the wrong numbers to management is a pain and embarresment most senior analyst have experienced. And as data moves into a more and more central position in the company the demand for quality data grows. This session provides a practical roadmap to getting your data cleaned up once and helps you define a standard for your data quality.

SAP Data Services 4.x Cookbook

Dive into "SAP Data Services 4.x Cookbook" to master the SAP Data Services platform and learn how to efficiently prepare, implement, and optimize ETL processes. This comprehensive guide makes it easy for you to understand both fundamental and advanced techniques of this powerful tool. What this Book will help me do Develop a thorough understanding of SAP Data Services concepts and architecture. Effectively set up and configure an ETL environment using SAP Data Services. Master advanced ETL design techniques to process and manipulate data effectively. Gain expertise in data cleansing, validation, and applying data quality methods. Build real-time ETL workflows and integrate various data systems seamlessly. Author(s) None Shomnikov is an experienced IT professional specializing in SAP Data Services and ETL processes. With years of practical experience, they bring a wealth of knowledge to help readers grasp concepts quickly and apply them effectively. None enjoys sharing practical solutions to complex problems in a clear and approachable manner. Who is it for? This book is ideal for IT professionals and engineers who are seeking to deepen their understanding of SAP Data Services. Readers should have a basic background in programming concepts and SQL to fully benefit from this book. It is particularly suited for professionals involved in ETL development and data quality management. By the end of the book, you will have a strong grasp of building reliable ETL workflows and managing data services efficiently.

Data Munging with Hadoop

The Example-Rich, Hands-On Guide to Data Munging with Apache Hadoop TM Data scientists spend much of their time “munging” data: handling day-to-day tasks such as data cleansing, normalization, aggregation, sampling, and transformation. These tasks are both critical and surprisingly interesting. Most important, they deepen your understanding of your data’s structure and limitations: crucial insight for improving accuracy and mitigating risk in any analytical project. Now, two leading Hortonworks data scientists, Ofer Mendelevitch and Casey Stella, bring together powerful, practical insights for effective Hadoop-based data munging of large datasets. Drawing on extensive experience with advanced analytics, the authors offer realistic examples that address the common issues you’re most likely to face. They describe each task in detail, presenting example code based on widely used tools such as Pig, Hive, and Spark. This concise, hands-on eBook is valuable for every data scientist, data engineer, and architect who wants to master data munging: not just in theory, but in practice with the field’s #1 platform–Hadoop. Coverage includes A framework for understanding the various types of data quality checks, including cell-based rules, distribution validation, and outlier analysis Assessing tradeoffs in common approaches to imputing missing values Implementing quality checks with Pig or Hive UDFs Transforming raw data into “feature matrix” format for machine learning algorithms Choosing features and instances Implementing text features via “bag-of-words” and NLP techniques Handling time-series data via frequency- or time-domain methods Manipulating feature values to prepare for modeling Data Munging with Hadoop is part of a larger, forthcoming work entitled Data Science Using Hadoop. To be notified when the larger work is available, register your purchase of Data Munging with Hadoop at informit.com/register and check the box “I would like to hear from InformIT and its family of brands about products and special offers.”

Building a Scalable Data Warehouse with Data Vault 2.0

The Data Vault was invented by Dan Linstedt at the U.S. Department of Defense, and the standard has been successfully applied to data warehousing projects at organizations of different sizes, from small to large-size corporations. Due to its simplified design, which is adapted from nature, the Data Vault 2.0 standard helps prevent typical data warehousing failures. "Building a Scalable Data Warehouse" covers everything one needs to know to create a scalable data warehouse end to end, including a presentation of the Data Vault modeling technique, which provides the foundations to create a technical data warehouse layer. The book discusses how to build the data warehouse incrementally using the agile Data Vault 2.0 methodology. In addition, readers will learn how to create the input layer (the stage layer) and the presentation layer (data mart) of the Data Vault 2.0 architecture including implementation best practices. Drawing upon years of practical experience and using numerous examples and an easy to understand framework, Dan Linstedt and Michael Olschimke discuss: How to load each layer using SQL Server Integration Services (SSIS), including automation of the Data Vault loading processes. Important data warehouse technologies and practices. Data Quality Services (DQS) and Master Data Services (MDS) in the context of the Data Vault architecture. Provides a complete introduction to data warehousing, applications, and the business context so readers can get-up and running fast Explains theoretical concepts and provides hands-on instruction on how to build and implement a data warehouse Demystifies data vault modeling with beginning, intermediate, and advanced techniques Discusses the advantages of the data vault approach over other techniques, also including the latest updates to Data Vault 2.0 and multiple improvements to Data Vault 1.0

Microsoft SQL Server 2014 Unleashed

The industry’s most complete, useful, and up-to-date guide to SQL Server 2014. You’ll find start-to-finish coverage of SQL Server’s core database server and management capabilities: all the real-world information, tips, guidelines, and examples you’ll need to install, monitor, maintain, and optimize the most complex database environments. The provided examples and sample code provide plenty of hands-on opportunities to learn more about SQL Server and create your own viable solutions. Four leading SQL Server experts present deep practical insights for administering SQL Server, analyzing and optimizing queries, implementing data warehouses, ensuring high availability, tuning performance, and much more. You will benefit from their behind-the-scenes look into SQL Server, showing what goes on behind the various wizards and GUI-based tools. You’ll learn how to use the underlying SQL commands to fully unlock the power and capabilities of SQL Server. Writing for all intermediate-to-advanced-level SQL Server professionals, the authors draw on immense production experience with SQL Server. Throughout, they focus on successfully applying SQL Server 2014’s most powerful capabilities and its newest tools and features. Detailed information on how to… Understand SQL Server 2014’s new features and each edition’s capabilities and licensing Install, upgrade to, and configure SQL Server 2014 for better performance and easier management Streamline and automate key administration tasks with Smart Admin Leverage powerful new backup/restore options: flexible backup to URL, Managed Backup to Windows Azure, and encrypted backups Strengthen security with new features for enforcing “least privilege” Improve performance with updateable columnstore indexes, Delayed Durability, and other enhancements Execute queries and business logic more efficiently with memoryoptimized tables, buffer pool extension, and natively-compiled stored procedures Control workloads and Disk I/O with the Resource Governor Deploy AlwaysOn Availability Groups and Failover Cluster Instances to achieve enterprise-class availability and disaster recovery Apply new Business Intelligence improvements in Master Data Services, data quality, and Parallel Data Warehouse

Sharing Data and Models in Software Engineering

Data Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and relevant. Starting with a background section of practical lessons and warnings for beginner data scientists for software engineering, this edited volume proceeds to identify critical questions of contemporary software engineering related to data and models. Learn how to adapt data from other organizations to local problems, mine privatized data, prune spurious information, simplify complex results, how to update models for new platforms, and more. Chapters share largely applicable experimental results discussed with the blend of practitioner focused domain expertise, with commentary that highlights the methods that are most useful, and applicable to the widest range of projects. Each chapter is written by a prominent expert and offers a state-of-the-art solution to an identified problem facing data scientists in software engineering. Throughout, the editors share best practices collected from their experience training software engineering students and practitioners to master data science, and highlight the methods that are most useful, and applicable to the widest range of projects. Shares the specific experience of leading researchers and techniques developed to handle data problems in the realm of software engineering Explains how to start a project of data science for software engineering as well as how to identify and avoid likely pitfalls Provides a wide range of useful qualitative and quantitative principles ranging from very simple to cutting edge research Addresses current challenges with software engineering data such as lack of local data, access issues due to data privacy, increasing data quality via cleaning of spurious chunks in data

Designing and Conducting Survey Research: A Comprehensive Guide, 4th Edition

The industry standard guide, updated with new ideas and SPSS analysis techniques Designing and Conducting Survey Research: A Comprehensive Guide Fourth Edition is the industry standard resource that covers all major components of the survey process, updated to include new data analysis techniques and SPSS procedures with sample data sets online. The book offers practical, actionable guidance on constructing the instrument, administrating the process, and analyzing and reporting the results, providing extensive examples and worksheets that demonstrate the appropriate use of survey and data techniques. By clarifying complex statistical concepts and modern analysis methods, this guide enables readers to conduct a survey research project from initial focus concept to the final report. Public and nonprofit managers with survey research responsibilities need to stay up-to-date on the latest methods, techniques, and best practices for optimal data collection, analysis, and reporting. Designing and Conducting Survey Research is a complete resource, answering the "what", "why", and "how" every step of the way, and providing the latest information about technological advancements in data analysis. The updated fourth edition contains step-by-step SPSS data entry and analysis procedures, as well as SPSS examples throughout the text, using real data sets from real-world studies. Other new information includes topics like: Nonresponse error/bias Ethical concerns and special populations Cell phone samples in telephone surveys Subsample screening and complex skip patterns The fourth edition also contains new information on the growing importance of focus groups, and places a special emphasis on data quality including size and variability. Those who employ survey research methods will find that Designing and Conducting Survey Research contains all the information needed to better design, conduct, and analyze a more effective survey.

Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP and SAS

Improve efficiency while reducing costs in clinical trials with centralized monitoring techniques using JMP and SAS.

International guidelines recommend that clinical trial data should be actively reviewed or monitored; the well-being of trial participants and the validity and integrity of the final analysis results are at stake. Traditional interpretation of this guidance for pharmaceutical trials has led to extensive on-site monitoring, including 100% source data verification. On-site review is time consuming, expensive (estimated at up to a third of the cost of a clinical trial), prone to error, and limited in its ability to provide insight for data trends across time, patients, and clinical sites. In contrast, risk-based monitoring (RBM) makes use of central computerized review of clinical trial data and site metrics to determine if and when clinical sites should receive more extensive quality review or intervention.

Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP and SAS presents a practical implementation of methodologies within JMP Clinical for the centralized monitoring of clinical trials. Focused on intermediate users, this book describes analyses for RBM that incorporate and extend the recommendations of TransCelerate Biopharm Inc., methods to detect potential patient-or investigator misconduct, snapshot comparisons to more easily identify new or modified data, and other novel visual and analytical techniques to enhance safety and quality reviews. Further discussion highlights recent regulatory guidance documents on risk-based approaches, addresses the requirements for CDISC data, and describes methods to supplement analyses with data captured external to the study database.

Given the interactive, dynamic, and graphical nature of JMP Clinical, any individual from the clinical trial team - including clinicians, statisticians, data managers, programmers, regulatory associates, and monitors - can make use of this book and the numerous examples contained within to streamline, accelerate, and enrich their reviews of clinical trial data.

The analytical methods described in Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP and SAS enable the clinical trial team to take a proactive approach to data quality and safety to streamline clinical development activities and address shortcomings while the study is ongoing.

This book is part of the SAS Press

Microsoft® SQL Server 2012 Unleashed

Buy the print version of¿ and get the eBook version for free! eBook version includes chapters 44-60 not included in the print. See inside the book for access code and details. Microsoft SQL Server 2012 Unleashed ¿ With up-to-the-minute content, this is the industry’s most complete, useful guide to SQL Server 2012. ¿ You’ll find start-to-finish coverage of SQL Server’s core database server and management capabilities: all the real-world information, tips, guidelines, and samples you’ll need to create and manage complex database solutions. The additional online chapters add extensive coverage of SQL Server Integration Services, Reporting Services, Analysis Services, T-SQL programming, .NET Framework integration, and much more. ¿ Authored by four expert SQL Server administrators, designers, developers, architects, and consultants, this book reflects immense experience with SQL Server in production environments. Intended for intermediate-to-advanced-level SQL Server professionals, it focuses on the product’s most complex and powerful capabilities, and its newest tools and features. Understand SQL Server 2012’s newest features, licensing changes, and capabilities of each edition Manage SQL Server 2012 more effectively with SQL Server Management Studio, the SQLCMD command-line query tool, and Powershell Use Policy-Based Management to centrally configure and operate SQL Server Utilize the new Extended Events trace capabilities within SSMS Maximize performance by optimizing design, queries, analysis, and workload management Implement new best practices for SQL Server high availability Deploy AlwaysOn Availability Groups and Failover Cluster Instances to achieve enterprise-class availability and disaster recovery Leverage new business intelligence improvements, including Master Data Services, Data Quality Services and Parallel Data Warehouse Deliver better full-text search with SQL Server 2012’s new Semantic Search Improve reporting with new SQL Server 2012 Reporting Services features Download the following from informit.com/title/9780672336928: Sample databases and code examples ¿ ¿

Using OpenRefine

Using OpenRefine provides a comprehensive guide to managing and cleaning large datasets efficiently. By following a practical, recipe-based approach, this book ensures readers can quickly master OpenRefine's features to enhance their data handling skills. Whether dealing with transformations, entity recognition, or dataset linking, you'll gain the tools to make your data work for you. What this Book will help me do Import and structure various formats of data for seamless processing. Apply both basic and advanced transformations to optimize data quality. Utilize regular expressions for sophisticated filtering and partitioning. Perform named-entity extraction and advanced reconciliation tasks. Master the General Refine Expression Language for powerful data operations. Author(s) The author is an experienced data analyst and educator, specializing in data preparation and transformation for real-world applications. Their approach combines a thorough technical understanding with an accessible teaching style, ensuring that complex concepts are easy to grasp. Who is it for? This book is crafted for anyone working with large datasets, from novices learning to handle and clean data to experienced practitioners seeking advanced techniques. If you aim to improve your data management skills or deliver quality insights from messy data, this book is for you.

Designing and Conducting Business Surveys

Designing and Conducting Business Surveys provides a coherent overview of the business survey process, from start to finish. It uniquely integrates an understanding of how businesses operate, a total survey error approach to data quality that focuses specifically on business surveys, and sound project management principles. The book brings together what is currently known about planning, designing, and conducting business surveys, with producing and disseminating statistics or other research results from the collected data. This knowledge draws upon a variety of disciplines such as survey methodology, organizational sciences, sociology, psychology, and statistical methods. The contents of the book formulate a comprehensive guide to scholarly material previously dispersed among books, journal articles, and conference papers. This book provides guidelines that will help the reader make educated trade-off decisions that minimize survey errors, costs, and response burden, while being attentive to survey data quality. Major topics include: Determining the survey content, considering user needs, the business context, and total survey quality Planning the survey as a project Sampling frames, procedures, and methods Questionnaire design and testing for self-administered paper, web, and mixed-mode surveys Survey communication design to obtain responses and facilitate the business response process Conducting and managing the survey using paradata and project management tools Data processing, including capture, editing, and imputation, and dissemination of statistical outputs Designing and Conducting Business Surveys is an indispensable resource for anyone involved in designing and/or conducting business or organizational surveys at statistical institutes, central banks, survey organizations, etc.; producing statistics or other research results from business surveys at universities, research organizations, etc.; or using data produced from business surveys. The book also lays a foundation for new areas of research in business surveys.

IBM Information Server: Integration and Governance for Emerging Data Warehouse Demands

This IBM® Redbooks® publication is intended for business leaders and IT architects who are responsible for building and extending their data warehouse and Business Intelligence infrastructure. It provides an overview of powerful new capabilities of Information Server in the areas of big data, statistical models, data governance and data quality. The book also provides key technical details that IT professionals can use in solution planning, design, and implementation.

Oracle GoldenGate 11g Handbook

Master Oracle GoldenGate 11 g Enable highly available, real-time access to enterprise data in heterogeneous environments. Featuring hands-on workshops, Oracle GoldenGate 11g Handbook shows you how to install, configure, and implement this high-performance application. You’ll learn how to replicate data across Oracle databases and other platforms, including MySQL and Microsoft SQL Server, and perform near-zero-downtime migrations and upgrades. Monitoring, performance tuning, and troubleshooting are also discussed in this Oracle Press guide. Install and configure Oracle GoldenGate Implement Oracle GoldenGate one-way replication Configure multitarget and cascading replication Use bidirectional replication to build a heterogeneous database infrastructure Secure your environment, control and manipulate data, and prevent errors Configure Oracle GoldenGate for Oracle Clusterware and Oracle Real Application Clusters Use Oracle GoldenGate with MySQL and Microsoft SQL Server Perform near-zero-downtime upgrades and migrations Use Oracle GoldenGate Monitor and Oracle GoldenGate Director Ensure data quality with Oracle GoldenGate Veridata Implement nondatabase integration options

Adobe Analytics with SiteCatalyst Classroom in a Book

In digital marketing, your goal is to funnel your potential customers from the point of making them aware of your website, through engagement and conversion, and ultimately retaining them as loyal customers. Your strategies must be based on careful analysis so you know what is working for you at each stage. Adobe Analytics with SiteCatalyst Classroom in a Book teaches effective techniques for using Adobe SiteCatalyst to establish and measure key performance indicators (KPIs) tailored to your business and website. For each phase of marketing funnel analytics, author Vidya Subramanian walks you through multiple reports, showing you how to interpret the data and highlighting implementation details that affect data quality. With this essential guide, you’ll learn to optimize your web analytics results with SiteCatalyst. Adobe Analytics with SiteCatalyst Classroom in a Book contains 10 lessons. The book covers the basics of learning Adobe SiteCatalyst and provides countless tips and techniques to help you become more productive with the program. You can follow the book from start to finish or choose only those lessons that interest you. Classroom in a Book®, the best-selling series of hands-on software training workbooks, helps you learn the features of Adobe software quickly and easily. Classroom in a Book offers what no other book or training program does—an official training series from Adobe Systems Incorporated, developed with the support of Adobe product experts. ..

Training Kit (Exam 70-463): Implementing a Data Warehouse with Microsoft SQL Server 2012

Ace your preparation for Microsoft® Certification Exam 70-463 with this 2-in-1 Training Kit from Microsoft Press®. Work at your own pace through a series of lessons and practical exercises, and then assess your skills with online practice tests—featuring multiple, customizable testing options. Maximize your performance on the exam by learning how to: Design and implement a data warehouse Develop and enhance SQL Server Integration Services packages Manage and maintain SQL Server Integration Services packages Build data quality solutions Implement custom code in SQL Server Integration Services packages

Data Clean-Up and Management

Data use in the library has specific characteristics and common problems. Data Clean-up and Management addresses these, and provides methods to clean up frequently-occurring data problems using readily-available applications. The authors highlight the importance and methods of data analysis and presentation, and offer guidelines and recommendations for a data quality policy. The book gives step-by-step how-to directions for common dirty data issues. Focused towards libraries and practicing librarians Deals with practical, real-life issues and addresses common problems that all libraries face Offers cradle-to-grave treatment for preparing and using data, including download, clean-up, management, analysis and presentation