Responsible Data Science
John Wiley & Sons Inc (Verlag)
978-1-119-74175-6 (ISBN)
The increasing popularity of data science has resulted in numerous well-publicized cases of bias, injustice, and discrimination. The widespread deployment of “Black box” algorithms that are difficult or impossible to understand and explain, even for their developers, is a primary source of these unanticipated harms, making modern techniques and methods for manipulating large data sets seem sinister, even dangerous. When put in the hands of authoritarian governments, these algorithms have enabled suppression of political dissent and persecution of minorities. To prevent these harms, data scientists everywhere must come to understand how the algorithms that they build and deploy may harm certain groups or be unfair.
Responsible Data Science delivers a comprehensive, practical treatment of how to implement data science solutions in an even-handed and ethical manner that minimizes the risk of undue harm to vulnerable members of society. Both data science practitioners and managers of analytics teams will learn how to:
Improve model transparency, even for black box models
Diagnose bias and unfairness within models using multiple metrics
Audit projects to ensure fairness and minimize the possibility of unintended harm
Perfect for data science practitioners, Responsible Data Science will also earn a spot on the bookshelves of technically inclined managers, software developers, and statisticians.
GRANT FLEMING is a Data Scientist at Elder Research Inc. His professional focus is on machine learning for social science applications, model interpretability, civic technology, and building software tools for reproducible data science. PETER BRUCE is the Senior Learning Officer at Elder Research, Inc., author of several best-selling texts on data science, and Founder of the Institute for Statistics Education at Statistics.com, an Elder Research Company.
Introduction xix
Part I Motivation for Ethical Data Science and Background Knowledge 1
Chapter 1 Responsible Data Science 3
The Optum Disaster 4
Jekyll and Hyde 5
Eugenics 7
Galton, Pearson, and Fisher 7
Ties between Eugenics and Statistics 7
Ethical Problems in Data Science Today 9
Predictive Models 10
From Explaining to Predicting 10
Predictive Modeling 11
Setting the Stage for Ethical Issues to Arise 12
Classic Statistical Models 12
Black-Box Methods 14
Important Concepts in Predictive Modeling 19
Feature Selection 19
Model-Centric vs. Data-Centric Models 20
Holdout Sample and Cross-Validation 20
Overfitting 21
Unsupervised Learning 22
The Ethical Challenge of Black Boxes 23
Two Opposing Forces 24
Pressure for More Powerful AI 24
Public Resistance and Anxiety 24
Summary 25
Chapter 2 Background: Modeling and the Black-Box Algorithm 27
Assessing Model Performance 27
Predicting Class Membership 28
The Rare Class Problem 28
Lift and Gains 28
Area Under the Curve 29
AUC vs. Lift (Gains) 31
Predicting Numeric Values 32
Goodness-of-Fit 32
Holdout Sets and Cross-Validation 33
Optimization and Loss Functions 34
Intrinsically Interpretable Models vs. Black-Box Models 35
Ethical Challenges with Interpretable Models 38
Black-Box Models 39
Ensembles 39
Nearest Neighbors 41
Clustering 41
Association Rules 42
Collaborative Filters 42
Artificial Neural Nets and Deep Neural Nets 43
Problems with Black-Box Predictive Models 45
Problems with Unsupervised Algorithms 47
Summary 48
Chapter 3 The Ways AI Goes Wrong, and the Legal Implications 49
AI and Intentional Consequences by Design 50
Deepfakes 50
Supporting State Surveillance and Suppression 51
Behavioral Manipulation 52
Automated Testing to Fine-Tune Targeting 53
AI and Unintended Consequences 55
Healthcare 56
Finance 57
Law Enforcement 58
Technology 60
The Legal and Regulatory Landscape around AI 61
Ignorance Is No Defense: AI in the Context of Existing Law and Policy 63
A Finger in the Dam: Data Rights, Data Privacy, and Consumer Protection Regulations 64
Trends in Emerging Law and Policy Related to AI 66
Summary 69
Part II The Ethical Data Science Process 71
Chapter 4 The Responsible Data Science Framework 73
Why We Keep Building Harmful AI 74
Misguided Need for Cutting-Edge Models 74
Excessive Focus on Predictive Performance 74
Ease of Access and the Curse of Simplicity 76
The Common Cause 76
The Face Thieves 78
An Anatomy of Modeling Harms 79
The World: Context Matters for Modeling 80
The Data: Representation Is Everything 83
The Model: Garbage In, Danger Out 85
Model Interpretability: Human Understanding for Superhuman Models 86
Efforts Toward a More Responsible Data Science 89
Principles Are the Focus 90
Nonmaleficence 90
Fairness 90
Transparency 91
Accountability 91
Privacy 92
Bridging the Gap Between Principles and Practice with the Responsible Data Science (RDS) Framework 92
Justification 94
Compilation 94
Preparation 95
Modeling 96
Auditing 96
Summary 97
Chapter 5 Model Interpretability: The What and the Why 99
The Sexist Résumé Screener 99
The Necessity of Model Interpretability 101
Connections Between Predictive Performance and Interpretability 103
Uniting (High) Model Performance and Model Interpretability 105
Categories of Interpretability Methods 107
Global Methods 107
Local Methods 113
Real-World Successes of Interpretability Methods 113
Facilitating Debugging and Audit 114
Leveraging the Improved Performance of Black-Box Models 116
Acquiring New Knowledge 116
Addressing Critiques of Interpretability Methods 117
Explanations Generated by Interpretability Methods Are Not Robust 118
Explanations Generated by Interpretability Methods Are Low Fidelity 120
The Forking Paths of Model Interpretability 121
The Four-Measure Baseline 122
Building Our Own Credit Scoring Model 124
Using Train-Test Splits 125
Feature Selection and Feature Engineering 125
Baseline Models 127
The Importance of Making Your Code Work for Everyone 129
Execution Variability 129
Addressing Execution Variability with Functionalized Code 130
Stochastic Variability 130
Addressing Stochastic Variability via Resampling 130
Summary 133
Part III EDS in Practice 135
Chapter 6 Beginning a Responsible Data Science Project 137
How the Responsible Data Science Framework Addresses the Common Cause 138
Datasets Used 140
Regression Datasets—Communities and Crime 140
Classification Datasets—COMPAS 140
Common Elements Across Our Analyses 141
Project Structure and Documentation 141
Project Structure for the Responsible Data
Science Framework: Everything in Its Place 142
Documentation: The Responsible Thing to Do 145
Beginning a Responsible Data Science Project 151
Communities and Crime (Regression) 151
Justification 151
Compilation 154
Identifying Protected Classes 157
Preparation—Data Splitting and Feature Engineering 159
Datasheets 161
COMPAS (Classification) 164
Justification 164
Compilation 166
Identifying Protected Classes 168
Preparation 169
Summary 172
Chapter 7 Auditing a Responsible Data Science Project 173
Fairness and Data Science in Practice 175
The Many Different Conceptions of Fairness 175
Different Forms of Fairness Are Trade-Offs with Each Other 177
Quantifying Predictive Fairness Within a Data Science Project 179
Mitigating Bias to Improve Fairness 185
Preprocessing 185
In-processing 186
Postprocessing 186
Classification Example: COMPAS 187
Prework: Code Practices, Modeling, and Auditing 187
Justification, Compilation, and Preparation Review 189
Modeling 191
Auditing 200
Per-Group Metrics: Overall 200
Per-Group Metrics: Error 202
Fairness Metrics 204
Interpreting Our Models: Why Are They Unfair? 207
Analysis for Different Groups 209
Bias Mitigation 214
Preprocessing: Oversampling 214
Postprocessing: Optimizing Thresholds
Automatically 218
Postprocessing: Optimizing Thresholds Manually 219
Summary 223
Chapter 8 Auditing for Neural Networks 225
Why Neural Networks Merit Their Own Chapter 227
Neural Networks Vary Greatly in Structure 227
Neural Networks Treat Features Differently 229
Neural Networks Repeat Themselves 231
A More Impenetrable Black Box 232
Baseline Methods 233
Representation Methods 233
Distillation Methods 234
Intrinsic Methods 235
Beginning a Responsible Neural Network Project 236
Justification 236
Moving Forward 239
Compilation 239
Tracking Experiments 241
Preparation 244
Modeling 245
Auditing 247
Per-Group Metrics: Overall 247
Per-Group Metrics: Unusual Definitions of “False Positive” 248
Fairness Metrics 249
Interpreting Our Models: Why Are They Unfair? 252
Bias Mitigation 253
Wrap-Up 255
Auditing Neural Networks for Natural Language Processing 258
Identifying and Addressing Sources of Bias in NLP 258
The Real World 259
Data 260
Models 261
Model Interpretability 262
Summary 262
Chapter 9 Conclusion 265
How Can We Do Better? 267
The Responsible Data Science Framework 267
Doing Better As Managers 269
Doing Better As Practitioners 270
A Better Future If We Can Keep It 271
Index 273
Erscheinungsdatum | 25.06.2021 |
---|---|
Verlagsort | New York |
Sprache | englisch |
Maße | 185 x 231 mm |
Gewicht | 522 g |
Themenwelt | Informatik ► Office Programme ► Outlook |
ISBN-10 | 1-119-74175-0 / 1119741750 |
ISBN-13 | 978-1-119-74175-6 / 9781119741756 |
Zustand | Neuware |
Haben Sie eine Frage zum Produkt? |
aus dem Bereich