Data Quality and Record Linkage Techniques (eBook)
XIV, 234 Seiten
Springer New York (Verlag)
978-0-387-69505-1 (ISBN)
This book offers a practical understanding of issues involved in improving data quality through editing, imputation, and record linkage. The first part of the book deals with methods and models, focusing on the Fellegi-Holt edit-imputation model, the Little-Rubin multiple-imputation scheme, and the Fellegi-Sunter record linkage model. The second part presents case studies in which these techniques are applied in a variety of areas, including mortgage guarantee insurance, medical, biomedical, highway safety, and social insurance as well as the construction of list frames and administrative lists. This book offers a mixture of practical advice, mathematical rigor, management insight and philosophy.
Preface 5
Contents 7
About the Authors 12
1 Introduction 13
1.1. Audience and Objective 13
1.2. Scope 13
1.3. Structure 14
Part 1 Data Quality: What It Is, Why It Is Important, and How to Achieve It 17
2 What Is Data Quality and Why Should We Care? 19
2.1. When Are Data of High Quality? 19
2.2. Why Care About Data Quality? 22
2.3. How Do You Obtain High-Quality Data? 23
2.4. Practical Tips 25
2.5. Where Are We Now? 25
3 Examples of Entities Using Data to their Advantage/ Disadvantage 29
3.1. Data Quality as a Competitive Advantage 29
3.2. Data Quality Problems and their Consequences 32
3.3. How Many People Really Live to 100 and Beyond? Views from the United States, Canada, and the United Kingdom 37
3.4. Disabled Airplane Pilots – A Successful Application of Record Linkage 38
3.5. Completeness and Accuracy of a Billing Database: Why It Is Important to the Bottom Line 38
3.6. Where Are We Now? 39
4 Properties of Data Quality and Metrics for Measuring It 41
4.1. Desirable Properties of Databases/Lists 41
4.2. Examples of Merging Two or More Lists and the Issues that May Arise 43
4.3. Metrics Used when Merging Lists 45
4.4. Where Are We Now? 47
5 Basic Data Quality Tools 49
5.1. Data Elements 49
5.2. Requirements Document 50
5.3. A Dictionary of Tests 51
5.4. Deterministic Tests 52
5.5. Probabilistic Tests 56
5.6. Exploratory Data Analysis Techniques 56
5.7. Minimizing Processing Errors4 58
5.8. Practical Tips 58
5.9. Where Are We Now? 60
Part 2 Specialized Tools for Database Improvement 62
6 Mathematical Preliminaries for Specialized Data Quality Techniques 63
6.1. Conditional Independence1 63
6.2. Statistical Paradigms 65
6.3. Capture–Recapture Procedures and Applications 66
7 Automatic Editing and Imputation of Sample Survey Data 73
7.1. Introduction 73
7.2. Early Editing Efforts 75
7.3. Fellegi–Holt Model for Editing 76
7.4. Practical Tips 77
7.5. Imputation 78
7.6. Constructing a Unified Edit/Imputation Model 83
7.7. Implicit Edits – A Key Construct of Editing Software 85
7.8. Editing Software 87
7.9. Is Automatic Editing Taking Up Too Much Time and Money? 90
7.10. Selective Editing 91
7.11. Tips on Automatic Editing and Imputation 91
7.12. Where Are We Now? 92
8 Record Linkage – Methodology 93
8.1. Introduction 93
8.2. Why Did Analysts Begin Linking Records? 94
8.3. Deterministic Record Linkage 94
8.4. Probabilistic Record Linkage – A Frequentist Perspective 95
8.5. Probabilistic Record Linkage – A Bayesian Perspective 103
8.6. Where Are We Now? 104
9 Estimating the Parameters of the Fellegi – Sunter Record Linkage Model 105
9.1. Basic Estimation of Parameters Under Simple Agreement/ Disagreement Patterns 105
9.2. Parameter Estimates Obtained via Frequency- Based Matching1 106
9.3. Parameter Estimates Obtained Using Data from Current Files 108
9.4. Parameter Estimates Obtained via the EM Algorithm 109
9.5. Advantages and Disadvantages of Using the EM Algorithm to Estimate m- and u-probabilities 113
9.6. General Parameter Estimation Using the EM Algorithm 115
9.7. Where Are We Now? 118
10 Standardization and Parsing 119
10.1. Obtaining and Understanding Computer Files 121
10.2. Standardization of Terms 122
10.3. Parsing of Fields 123
10.4. Where Are We Now? 126
11 Phonetic Coding Systems for Names 127
11.1. Soundex System of Names 127
11.2. New York State Identification and Intelligence System ( NYSIIS) Phonetic Decoder 131
11.3. Where Are We Now? 133
12 Blocking 135
12.1. Independence of Blocking Strategies 136
12.2. Blocking Variables 137
12.3. Using Blocking Strategies to Identify Duplicate List Entries 138
12.4. Using Blocking Strategies to Match Records Between Two Sample Surveys 140
12.5. Estimating the Number of Matches Missed 142
12.6. Where Are We Now? 142
13 String Comparator Metrics for Typographical Error 143
13.1. Jaro String Comparator Metric for Typographical Error 143
13.2. Adjusting the Matching Weight for the Jaro String Comparator 145
13.3. Winkler String Comparator Metric for Typographical Error 145
13.4. Adjusting the Weights for the Winkler Comparator Metric 146
13.5. Where are We Now? 147
Part 3 Record Linkage Case Studies 149
Introduction to Part Three 149
14 Duplicate FHA Single-Family Mortgage Records 151
14.1. Introduction 151
14.2. FHA Case Numbers on Single-Family Mortgages 153
14.3. Duplicate Mortgage Records 153
14.4. Mortgage Records with an Incorrect Termination Status 157
14.5. Estimating the Number of Duplicate Mortgage Records 160
15 Record Linkage Case Studies in the Medical, Biomedical, and Highway Safety Areas 163
15.1. Biomedical and Genetic Research Studies 163
15.2. Who goes to a Chiropractor? 165
15.3. National Master Patient Index 166
15.4. Provider Access to Immunization Register Securely ( PAiRS) System 167
15.5. Studies Required by the Intermodal Surface Transportation Efficiency Act of 1991 168
15.6. Crash Outcome Data Evaluation System1 169
16 Constructing List Frames and Administrative Lists 171
16.1. National Address Register of Residences in Canada1 172
16.2. USDA List Frame of Farms in the United States 174
16.3. List Frame Development for the US Census of Agriculture4 177
16.4. Post-enumeration Studies of US Decennial Census 178
17 Social Security and Related Topics 181
17.1. Hidden Multiple Issuance of Social Security Numbers 181
17.2. How Social Security Stops Benefit Payments after Death 185
17.3. CPS–IRS–SSA Exact Match File 187
17.4. Record Linkage and Terrorism 189
Part 4 Other Topics 191
18 Confidentiality: Maximizing Access to Micro- data while Protecting Privacy 193
18.1. Importance of High Quality of Data in the Original File 194
18.2. Documenting Public-use Files 195
18.3. Checking Re-identifiability 195
18.4. Elementary Masking Methods and Statistical Agencies 198
18.5. Protecting Confidentiality of Medical Data 205
18.6. More-Advanced Masking Methods – Synthetic Datasets 207
18.7. Where Are We Now? 210
19 Review of Record Linkage Software 213
19.1. Government 213
19.2. Commercial 214
19.3. Checklist for Evaluating Record Linkage Software2 215
20 Summary Chapter 221
Bibliography 223
Index 233
7 Automatic Editing and Imputation of Sample Survey Data (p. 61)
7.1. Introduction
As discussed in Chapter 3, missing and contradictory data are endemic in computer databases. In Chapter 5, we described a number of basic data editing techniques that can be used to improve the quality of statistical data systems. By an edit we mean a set of values for a specified combination of data elements within a database that are jointly unacceptable (or, equivalently, jointly acceptable). Certainly, we can use edits of the types described in Chapter 5.
In this chapter, we discuss automated procedures for editing (i.e., cleaning up) and imputing (i.e., filling in) missing data in databases constructed from data obtained from respondents in sample surveys or censuses. To accomplish this task, we need efficient ways of developing statistical data edit/imputation systems that minimize development time, eliminate most errors in code development, and greatly reduce the need for human intervention.
In particular, we would like to drastically reduce, or eliminate entirely, the need for humans to change/correct data. The goal is to improve survey data so that they can be used for their intended analytic purposes.
One such important purpose is the publication of estimates of totals and subtotals that are free of self-contradictory information. We begin by discussing editing procedures, focusing on the model proposed by Fellegi and Holt [1976]. Their model was the first to provide fast, reproducible, table-driven methods that could be applied to general data. It was the first to assure that a record could be corrected in one pass through the data.
Prior to Fellegi and Holt, records were iteratively and slowly changed with no guarantee that any final set of changes would yield a record that satisfied all edits. We then describe a number of schemes for imputing missing data elements, emphasizing the work of Rubin [1987] and Little and Rubin [1987, 2002].
Two important advantages of the Little–Rubin approach are that (1) probability distributions are preserved by the use of defensible statistical models and (2) estimated variances include a component due to the imputation. In some situations, the Little–Rubin methods may need extra information about the non-response mechanism.
For instance, if certain high-income individuals have a stronger tendency to not report or misreport income, then a specific model for the income-reporting of these individuals may be needed. In other situations, the missing-data imputation can be done via methods that are straightforward extensions of hot-deck. We provide details of hot-deck and its extensions later in this chapter.
Ideally, we would like to have an all-purpose, unified edit/imputation model that incorporates the features of the Fellegi–Holt edit model and the Little– Rubin multiple imputation model. Unfortunately, we are not aware of such a model. However, Winkler [2003] provides a unified approach to edit and imputation when all of the data elements of interest can be considered to be discrete.
Erscheint lt. Verlag | 23.5.2007 |
---|---|
Zusatzinfo | XIV, 234 p. |
Verlagsort | New York |
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Datenbanken |
Informatik ► Theorie / Studium ► Algorithmen | |
Mathematik / Informatik ► Informatik ► Web / Internet | |
Mathematik / Informatik ► Mathematik ► Statistik | |
Technik | |
Schlagworte | Coding • Database • dataquality • Editing • Imputation • missing data • recordlinkage |
ISBN-10 | 0-387-69505-2 / 0387695052 |
ISBN-13 | 978-0-387-69505-1 / 9780387695051 |
Haben Sie eine Frage zum Produkt? |
Größe: 1,9 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich