Methodological Developments in Data Linkage

Katie Harron, Harvey Goldstein, Chris Dibben (Autoren)

Buch | Hardcover

288 Seiten

2015
John Wiley & Sons Inc (Verlag)
978-1-118-74587-8 (ISBN)

Titel z.Zt. nicht lieferbar
Versandkostenfrei
Auch auf Rechnung

Artikel merken

A comprehensive compilation of new developments in data linkage methodology

The increasing availability of large administrative databases has led to a dramatic rise in the use of data linkage, yet the standard texts on linkage are still those which describe the seminal work from the 1950-60s, with some updates. Linkage and analysis of data across sources remains problematic due to lack of discriminatory and accurate identifiers, missing data and regulatory issues. Recent developments in data linkage methodology have concentrated on bias and analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage.

Methodological Developments in Data Linkage brings together a collection of contributions from members of the international data linkage community, covering cutting edge methodology in this field. It presents opportunities and challenges provided by linkage of large and often complex datasets, including analysis problems, legal and security aspects, models for data access and the development of novel research areas. New methods for handling uncertainty in analysis of linked data, solutions for anonymised linkage and alternative models for data collection are also discussed.

Key Features:

Presents cutting edge methods for a topic of increasing importance to a wide range of research areas, with applications to data linkage systems internationally
Covers the essential issues associated with data linkage today
Includes examples based on real data linkage systems, highlighting the opportunities, successes and challenges that the increasing availability of linkage data provides
Novel approach incorporates technical aspects of both linkage, management and analysis of linked data

This book will be of core interest to academics, government employees, data holders, data managers, analysts and statisticians who use administrative data. It will also appeal to researchers in a variety of areas, including epidemiology, biostatistics, social statistics, informatics, policy and public health.

Editors: Katie Harron, London School of Hygiene and Tropical Medicine, UK Harvey Goldstein, University of Bristol and University College London, UK Chris Dibben, University of Edinburgh, UK

Foreword xi

Contributors xiii

1 Introduction 1
Katie Harron, Harvey Goldstein and Chris Dibben

1.1 Introduction: data linkage as it exists 1

1.2 Background and issues 2

1.3 Data linkage methods 3

1.3.1 Deterministic linkage 3

1.3.2 Probabilistic linkage 3

1.3.3 Data preparation 4

1.4 Linkage error 5

1.5 Impact of linkage error on analysis of linked data 6

1.6 Data linkage: the future 7

2 Probabilistic linkage 8
William E. Winkler

2.1 Introduction 8

2.2 Overview of methods 10

2.2.1 The Fellegi–Sunter model of record linkage 10

2.2.2 Learning parameters 13

2.2.3 Additional methods for matching 20

2.2.4 An empirical example 22

2.3 Data preparation 23

2.3.1 Description of a matching project 24

2.3.2 Initial file preparation 25

2.3.3 Name standardisation and parsing 26

2.3.4 Address standardisation and parsing 27

2.3.5 Summarising comments on preprocessing 27

2.4 Advanced methods 28

2.4.1 Estimating false]match rates without training data 28

2.4.2 Adjusting analyses for linkage error 32

2.5 Concluding comments 35

3 The data linkage environment 36
Chris Dibben, Mark Elliot, Heather Gowans, Darren Lightfoot and Data Linkage Centres

3.1 Introduction 36

3.2 The data linkage context 37

3.2.1 Administrative or routine data 37

3.2.2 The law and the use of administrative (personal) data for research 38

3.2.3 The identifiability problem in data linkage 42

3.3 The tools used in the production of functional anonymity through a data linkage environment 42

3.3.1 Governance, rules and the researcher 43

3.3.2 Application process, ethics scrutiny and peer review 43

3.3.3 Shaping ‘safe’ behaviour: training, sanctions, contracts and licences 43

3.3.4 ‘Safe’ data analysis environments 44

3.3.5 Fragmentation: separation of linkage process and temporary linked data 47

3.4 Models for data access and data linkage 50

3.4.1 Single centre 50

3.4.2 Separation of functions: firewalls within single centre 51

3.4.3 Separation of functions: TTP linkage 53

3.4.4 Secure multiparty computation 53

3.5 Four case study data linkage centres 54

3.5.1 Population Data BC 54

3.5.2 The Secure Anonymised Information Linkage Databank, United Kingdom 58

3.5.3 Centre for Data Linkage (Population Health Research Network), Australia 59

3.5.4 The Centre for Health Record Linkage, Australia 61

3.6 Conclusion 62

4 Bias in data linkage studies 63
Megan Bohensky

4.1 Background 63

4.2 Description of types of linkage error 65

4.2.1 Missed matches from missing linkage variables 65

4.2.2 Missed matches from inconsistent case ascertainment 66

4.2.3 False matches: Description of cases incorrectly matched 66

4.3 How linkage error impacts research findings 68

4.3.1 Results 68

4.3.2 Assessment of linkage bias 75

4.4 Discussion 78

4.4.1 Potential biases in the review process 79

4.4.2 Recommendations and implications for practice 79

5 Secondary analysis of linked data 83
Raymond Chambers and Gunky Kim

5.1 Introduction 83

5.2 Measurement error issues arising from linkage 84

5.2.1 Correct links, incorrect links and non]links 84

5.2.2 Characterising linkage errors 85

5.2.3 Characterising errors from non]linkage 86

5.3 Models for different types of linking errors 86

5.3.1 Linkage errors under binary linking 86

5.3.2 Linkage errors under multi]linking 88

5.3.3 Incomplete linking 88

5.3.4 Modelling the linkage error 89

5.4 Regression analysis using complete binary]linked data 90

5.4.1 Linear regression 91

5.4.2 Logistic regression 95

5.5 Regression analysis using incomplete binary]linked data 95

5.5.1 Linear regression using incomplete sample to register linked data 97

5.6 Regression analysis with multi]linked data 99

5.6.1 Uncorrelated multi]linking: Complete linkage 100

5.6.2 Uncorrelated multi]linking: Sample to register linkage 101

5.6.3 Correlated multi]linkage 105

5.6.4 Incorporating auxiliary population information 105

5.7 Conclusion and discussion 107

6 Record linkage: A missing data problem 109
Harvey Goldstein and Katie Harron

6.1 Introduction 109

6.2 Probabilistic Record Linkage (PRL) 111

6.3 Multiple Imputation (MI) 112

6.4 Prior-Informed Imputation (PII) 113

6.4.1 Estimating matching probabilities 115

6.5 Example 1: Linking electronic healthcare data to estimate trends in bloodstream infection 115

6.5.1 Methods 115

6.5.2 Results 117

6.5.3 Conclusions 118

6.6 Example 2: Simulated data including non]random linkage error 118

6.6.1 Methods 118

6.6.2 Results 119

6.7 Discussion 122

6.7.1 Non]random linkage error 122

6.7.2 Strengths and limitations: Handling linkage error 122

6.7.3 Implications for data linkers and data users 123

7 Using graph databases to manage linked data 125
James M. Farrow

7.1 Summary 125

7.2 Introduction 126

7.2.1 Flat approach 127

7.2.2 Oops, your legacy is showing 128

7.2.3 Shortcomings 128

7.3 Graph approach 131

7.3.1 Overview of graph concepts 131

7.3.2 Graph queries versus relational queries 133

7.3.3 Comparison of data in flat database versus graph database 136

7.3.4 Relaxing the notion of ‘truth’ 137

7.3.5 Not a linkage approach per se but a management approach which enables novel linkage approaches 138

7.3.6 Linkage engine independent 139

7.3.7 Separates out linkage from cluster identification phase (and clerical review) 139

7.4 Methodologies 139

7.4.1 Overview of storage and extraction approach 140

7.4.2 Overall management of data as collections 141

7.4.3 Data loading 142

7.4.4 Identification of equivalence sets and deterministic linkage 143

7.4.5 Probabilistic linkage 144

7.4.6 Clerical review 144

7.4.7 Determining cut]off thresholds 145

7.4.8 Final cluster extraction 147

7.4.9 Graph partitioning 147

7.4.10 Data management/curation 150

7.4.11 User interface challenges 150

7.4.12 Final cluster extraction 154

7.4.13 A typical end]to]end workflow 155

7.5 Algorithm implementation 156

7.5.1 Graph traversal 156

7.5.2 Cluster identification 157

7.5.3 Partitioning visitor 158

7.5.4 Encapsulating edge following policies 158

7.5.5 Graph partitioning 158

7.5.6 Insertion of review links 158

7.5.7 How to migrate while preserving current clusters 158

7.6 New approaches facilitated by graph storage approach 158

7.6.1 Multiple threshold extraction 160

7.6.2 Possibility of returning graph to end users 165

7.6.3 Optimised cluster analysis 166

7.6.4 Other link types 167

7.7 Conclusion 167

8 Large]scale linkage for total populations in official statistics 170
Owen Abbott, Peter Jones and Martin Ralphs

8.1 Introduction 170

8.2 Current practice in record linkage for population censuses 171

8.2.1 Introduction 171

8.2.2 Case study: the 2011 England and Wales Census assessment of coverage 172

8.3 Population]level linkage in countries that operate a population register: register]based censuses 178

8.3.1 Introduction 178

8.3.2 Case study 1: Finland 179

8.3.3 Case study 2: The Netherlands Virtual Census 180

8.3.4 Case study 3: Poland 180

8.3.5 Case study 4: Germany 181

8.3.6 Summary 181

8.4 New challenges in record linkage: the Beyond 2011 Programme 182

8.4.1 Introduction 182

8.4.2 Beyond 2011 linking methodology 183

8.4.3 The anonymisation process in Beyond 2011 184

8.4.4 Beyond 2011 linkage strategy using pseudonymised data 185

8.4.5 Linkage quality 195

8.4.6 Next steps 197

8.4.7 Conclusion 198

8.5 Summary 199

9 Privacy]preserving record linkage 201
Rainer Schnell

9.1 Introduction 201

9.2 Chapter outline 202

9.3 Linking with and without personal identification numbers 202

9.3.1 Linking using a trusted third party 203

9.3.2 Linking with encrypted PIDs 204

9.3.3 Linking with encrypted quasi]identifiers 204

9.3.4 PPRL in decentralised organisations 204

9.4 PPRL approaches 206

9.4.1 Phonetic codes 206

9.4.2 High]dimensional embeddings 206

9.4.3 Reference tables 207

9.4.4 Secure multiparty computations for PPRL 207

9.4.5 Bloom filter]based PPRL 207

9.5 PPRL for very large databases: blocking 209

9.5.1 Blocking for PPRL with Bloom filters 210

9.5.2 Blocking Bloom filters with MBT 211

9.5.3 Empirical comparison of blocking techniques for Bloom filters 211

9.5.4 Current recommendations for linking very large datasets with Bloom filters 213

9.6 Privacy considerations 213

9.6.1 Probability of attacks 214

9.6.2 Kind of attacks 215

9.6.3 Attacks on Bloom filters 215

9.7 Hardening Bloom filters 217

9.7.1 Randomly selected hash values 218

9.7.2 Random bits 218

9.7.3 Avoiding padding 220

9.7.4 Standardising the length of identifiers 220

9.7.5 Sampling bits for composite Bloom filters 221

9.7.6 Rehashing 221

9.7.7 Salting keys with record]specific data 223

9.7.8 Fake injections 223

9.7.9 Evaluation of Bloom filter hardening procedures 223

9.8 Future research 224

9.9 PPRL research and implementation with national databases 225

10 Summary 226
Katie Harron, Chris Dibben and Harvey Goldstein

10.1 Introduction 226

10.2 Part 1: Data linkage as it exists today 226

10.3 Part 2: Analysis of linked data 227

10.3.1 Quality of identifiers 227

10.3.2 Quality of linkage methods 228

10.3.3 Quality of evaluation 228

10.4 Part 3: Data linkage in practice: new developments 229

10.5 Concluding remarks 231

References 233

Index 253

Erscheint lt. Verlag	14.12.2015
Reihe/Serie	Wiley Series in Probability and Statistics
Verlagsort	New York
Sprache	englisch
Maße	178 x 252 mm
Gewicht	590 g
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik
	Studium ► Querschnittsbereiche ► Epidemiologie / Med. Biometrie
ISBN-10	1-118-74587-6 / 1118745876
ISBN-13	978-1-118-74587-8 / 9781118745878
Zustand	Neuware