Learning from Imbalanced Data Sets (eBook)
XVIII, 377 Seiten
Springer International Publishing (Verlag)
978-3-319-98074-4 (ISBN)
This book provides a general and comprehensible overview of imbalanced learning. It contains a formal description of a problem, and focuses on its main features, and the most relevant proposed solutions. Additionally, it considers the different scenarios in Data Science for which the imbalanced classification can create a real challenge.
This book stresses the gap with standard classification tasks by reviewing the case studies and ad-hoc performance metrics that are applied in this area. It also covers the different approaches that have been traditionally applied to address the binary skewed class distribution. Specifically, it reviews cost-sensitive learning, data-level preprocessing methods and algorithm-level solutions, taking also into account those ensemble-learning solutions that embed any of the former alternatives. Furthermore, it focuses on the extension of the problem for multi-class problems, where the former classical methods are no longer to be applied in a straightforward way.This book also focuses on the data intrinsic characteristics that are the main causes which, added to the uneven class distribution, truly hinders the performance of classification algorithms in this scenario. Then, some notes on data reduction are provided in order to understand the advantages related to the use of this type of approaches.
Finally this book introduces some novel areas of study that are gathering a deeper attention on the imbalanced data issue. Specifically, it considers the classification of data streams, non-classical classification problems, and the scalability related to Big Data. Examples of software libraries and modules to address imbalanced classification are provided.
This book is highly suitable for technical professionals, senior undergraduate and graduate students in the areas of data science, computer science and engineering. It will also be useful for scientists and researchers to gain insight on the current developments in this area of study, as well as future research directions.
Preface 6
Contents 8
Acronyms 15
1 Introduction to KDD and Data Science 17
1.1 Introduction 17
1.2 A Definition of Data Science 19
1.3 The Data Science Process 20
1.3.1 Selection of the Data 22
1.3.2 Data Preprocessing 23
1.3.2.1 Why Is Preprocessing Required? 23
1.3.3 Stages of the Data Preprocessing Phase 24
1.3.3.1 Selection of Data 25
1.3.3.2 Exploration of Data 26
1.3.3.3 Transformation of Data 27
1.4 Standard Data Science Problems 27
1.4.1 Descriptive Problems 27
1.4.2 Predictive Problems 28
1.5 Classical Data Mining Techniques 29
1.6 Non-standard Data Science Problems 30
1.6.1 Derivative Problems 30
1.6.1.1 Imbalanced Learning 30
1.6.1.2 Multi-instance Learning 31
1.6.1.3 Multi-label Classification 31
1.6.1.4 Data Stream Learning 31
1.6.2 Hybrid Problems 31
1.6.2.1 Semi-supervised Learning 31
1.6.2.2 Subgroup Discovery 31
1.6.2.3 Ordinal Classification/Regression 32
1.6.2.4 Transfer Learning 32
References 32
2 Foundations on Imbalanced Classification 34
2.1 Formal Description 34
2.2 Applications 39
2.2.1 Engineering 42
2.2.2 Information Technology 43
2.2.3 Bioinformatics 45
2.2.4 Medicine 46
2.2.4.1 Quality Control 46
2.2.4.2 Medical Diagnosis 47
2.2.4.3 Medical Prognosis 49
2.2.5 Business Management 50
2.2.6 Security 50
2.2.7 Education 51
2.3 Case Studies on Imbalanced Classification 51
References 56
3 Performance Measures 62
3.1 Introduction 62
3.2 Nominal Class Predictions 63
3.3 Scoring Predictions 68
3.4 Probabilistic Predictions 72
3.5 Summarizing Comments 73
References 74
4 Cost-Sensitive Learning 77
4.1 Introduction 77
4.2 Obtaining the Cost Matrix 80
4.3 MetaCost 82
4.4 Cost-Sensitive Decision Trees 83
4.4.1 Direct Approach with Cost-Sensitive Splitting 84
4.4.2 Meta-learning Approach with Instance Weighting 85
4.5 Other Cost-Sensitive Classifiers 86
4.5.1 Support Vector Machines 86
4.5.2 Artificial Neural Networks 87
4.5.3 Nearest Neighbors 87
4.6 Hybrid Cost-Sensitive Approaches 87
4.7 Summarizing Comments 88
References 89
5 Data Level Preprocessing Methods 93
5.1 Introduction 93
5.2 Undersampling and Oversampling Basics 96
5.3 Advanced Undersampling Techniques 100
5.3.1 Evolutionary Undersampling 101
5.3.1.1 ACOSampling 103
5.3.1.2 IPADE-ID 104
5.3.1.3 CBEUS: Cluster-Based Evolutionary Undersampling 105
5.3.2 Undersampling by Cleaning Data 106
5.3.2.1 Weighted Sampling 106
5.3.2.2 IHT: Instance Hardness Threshold 106
5.3.2.3 Hybrid Undersampling 108
5.3.3 Ensemble Based Undersampling 108
5.3.3.1 IRUS: Inverse Random Undersampling 109
5.3.3.2 OligoIS: Oligarchic Instance Selection 110
5.3.4 Clustering Based Undersampling 110
5.3.4.1 ClusterOSS 111
5.3.4.2 DSUS: Diversified Sensitivity Undersampling 111
5.4 Synthetic Minority Oversampling TEchnique (SMOTE) 112
5.5 Extensions of SMOTE 115
5.5.1 Borderline-SMOTE 115
5.5.2 Adjusting the Direction of the Synthetic Minority ClasS Examples: ADOMS 117
5.5.3 ADASYN: Adaptive Synthetic Sampling Approach 118
Input 119
Procedure 119
5.5.4 ROSE: Random Oversampling Examples 120
5.5.5 Safe-Level-SMOTE 122
5.5.6 DBSMOTE: Density-Based SMOTE 122
5.5.7 MWMOTE: Majority Weighted Minority Oversampling TEchnique 124
Input 126
Procedure 126
5.5.8 MDO: Mahalanobis Distance-Based Oversampling Technique 128
5.6 Hybridizations of Undersampling and Oversampling 128
5.7 Summarizing Comments 131
References 131
6 Algorithm-Level Approaches 136
6.1 Introduction 136
6.2 Support Vector Machines 137
6.2.1 Kernel Modifications 140
6.2.1.1 Kernel Boundary and Margin Shift 140
6.2.1.2 Kernel Target Alignment 141
6.2.1.3 Kernel Scaling 141
6.2.2 Weighted Approaches 142
6.2.2.1 Instance Weighting 142
6.2.2.2 Support Vector Weighting 143
6.2.2.3 Fuzzy Approaches 144
6.2.3 Active Learning 146
6.3 Decision Trees 147
6.4 Nearest Neighbor Classifiers 149
6.5 Bayesian Classifiers 151
6.6 One-Class Classifiers 152
6.7 Summarizing Comments 154
References 154
7 Ensemble Learning 160
7.1 Introduction 160
7.2 Foundations on Ensemble Learning 161
7.2.1 Bagging 165
7.2.2 Boosting 168
7.2.3 Techniques to Increase Diversity in Classifier Ensembles 173
7.3 Ensemble Learning for Addressing the Class Imbalance Problem 174
7.3.1 Cost-Sensitive Boosting 176
7.3.1.1 AdaCost 178
7.3.1.2 CSB 179
7.3.1.3 RareBoost 179
7.3.1.4 AdaC1 180
7.3.1.5 AdaC2 180
7.3.1.6 AdaC3 181
7.3.2 Ensembles with Cost-Sensitive Base Classifiers 181
7.3.2.1 BoostedCS-SVM 181
7.3.2.2 BoostedWeightedELM 182
7.3.2.3 CS-DT-Ensemble 182
7.3.2.4 BayEnsBNN 182
7.3.2.5 AL-BoostedCS-SVM 183
7.3.2.6 IC-BoostedCS-SVM 183
7.3.3 Boosting-Based Ensembles 183
7.3.3.1 SMOTEBoost/MSMOTEBoost 183
7.3.3.2 RUSBoost 184
7.3.3.3 DataBoost-IM 184
7.3.3.4 RAMOBoost 185
7.3.3.5 Adaboost.NC 185
7.3.3.6 EUSBoost 186
7.3.3.7 GESuperPBoost 186
7.3.3.8 BalancedBoost 186
7.3.3.9 RB-Boost 186
7.3.3.10 Balanced-St-GrBoost 187
7.3.4 Bagging-Based Ensembles 187
7.3.4.1 OverBagging 188
7.3.4.2 UnderBagging 188
7.3.4.3 UnderOverBagging 189
7.3.4.4 IIVotes 190
7.3.4.5 RB-Bagging 190
7.3.4.6 EPRENNID 190
7.3.4.7 USwitchingNED 191
7.3.5 Hybrid Ensembles 191
7.3.5.1 EasyEnsemble 192
7.3.5.2 BalanceCascade 192
7.3.5.3 HardEnsemble 192
7.3.5.4 StochasticEnsemble 193
7.3.6 Other 193
7.3.6.1 MOGP-GP 193
7.3.6.2 RandomOracles 193
7.3.6.3 Loss Factors 194
7.3.6.4 GOBoost 194
7.3.6.5 OrderingBasedPruning 194
7.3.6.6 Diversity Enhancing Techniques for Improving Ensembles 195
7.3.6.7 PT-Bagging 195
7.3.6.8 IMCStacking 195
7.3.6.9 DynamicSelection 196
7.4 An Illustrative Experimental Study on Ensembles for the Class Imbalance Problem 196
7.4.1 Experimental Framework 197
7.4.1.1 Datasets and Performance Measures 197
7.4.1.2 Algorithms and Parameters 197
7.4.1.3 Statistical Analysis 199
7.4.2 Experimental Results and Discussion 199
7.5 Summarizing Contents 203
References 204
8 Imbalanced Classification with Multiple Classes 210
8.1 Introduction 210
8.2 Multi-class Imbalanced Learning via Decomposition-Based Approaches 212
8.2.1 Reducing Multi-class Problems by Binarization Techniques 212
8.2.1.1 The One-vs-One Scheme (OVO) 212
8.2.1.2 The One-vs-All Scheme (OVA) 213
8.2.2 Binary Imbalanced Approaches for Multi-class Problems 214
8.2.3 Discussion on the Capabilities of Decomposition Strategies 217
8.3 Ad-hoc Approaches for Multi-class Imbalanced Classification 219
8.3.1 Multi-class Preprocessing Techniques 219
8.3.2 Algorithmic Solutions on Multi-class 220
8.3.3 Multi-class Cost-Sensitive Learning 222
8.3.4 Ensemble Approaches 223
8.3.5 Summary and Future Prospects on Ad-hoc Approaches 225
8.3.5.1 Preprocessing Techniques 225
8.3.5.2 Algorithmic Approaches 226
8.3.5.3 Cost-Sensitive Learning 226
8.3.5.4 Ensemble Systems 226
8.4 Performance Metrics in Multi-class Imbalanced Problems 226
8.5 A Brief Experimental Analysis for Imbalanced Multi-class Problems 230
8.5.1 Experimental Setup 230
8.5.2 Experimental Results and Discussion 232
8.6 Summarizing Comments 234
References 234
9 Dimensionality Reduction for Imbalanced Learning 240
9.1 Introduction 240
9.2 Feature Selection 242
9.2.1 Studies of Classical Feature Selection in Imbalance Learning 243
9.2.2 Ad-hoc Feature Selection Techniques for Tackling Imbalance Classification 245
9.2.2.1 Feature Selection with Biased Sample Distribution 246
9.2.2.2 Combating the Small Sample Class Imbalance Problem Using Feature Selection 248
9.2.2.3 Discriminative Feature Selection by Nonparametric Bayes Error Minimization 249
9.2.2.4 Feature Selection for High-Dimensional Imbalanced Data 249
9.2.2.5 Iterative Feature Selection 251
9.3 Advanced Feature Selection 252
9.3.1 Ensemble and Wrapper-Based Techniques 252
9.3.2 Evolutionary-Based Techniques 253
9.4 Linear Models for Feature Extraction 253
9.4.1 Asymmetric Principal Component Analysis 254
9.4.2 Extraction of Minimum Positive and Maximum Negative Features 256
9.4.2.1 Model 1 257
9.4.2.2 Model 2 258
9.5 Non-linear Models for Feature Extraction: Autoencoders 258
9.6 Discretization in Imbalanced Data: ur-CAIM 261
9.7 Summarizing Comments 262
References 263
10 Data Intrinsic Characteristics 265
10.1 Introduction 265
10.2 Data Complexity for Imbalanced Datasets 266
10.3 Sub-concepts and Small-Disjuncts 267
10.4 Lack of Data 273
10.5 Overlapping and Separability 274
10.6 Noisy Data 276
10.7 Borderline Examples 279
10.8 Dataset Shift 282
10.9 Imperfect Data 284
10.10 Summarizing Comments 285
References 285
11 Learning from Imbalanced Data Streams 290
11.1 Introduction 290
11.2 Characteristics of Imbalanced Data Streams 295
11.3 Data-Level and Algorithm-Level Approaches 298
11.3.1 Undersampling Naïve Bayes 298
11.3.2 Generalized Over-sampling Based Online Imbalanced Learning Framework (GOS-IL) 299
11.3.3 Sequential SMOTE 299
11.3.4 Recursive Least Square Perceptron Model (RLSACP) and Online Neural Network for Non-stationary and Imbalanced Data Streams (ONN) 299
11.3.5 Dynamic Class Imbalance for Linear Proximal SVMs (DCIL-IncLPSVM) 300
11.3.6 Kernelized Online Imbalanced Learning (KOIL) 300
11.3.7 Gaussian Hellinger Very Fast Decision Tree (GH-VFDT) 300
11.3.8 Cost-Sensitive Fast Perceptron Tree (CSPT) 301
11.4 Ensemble Learning Approaches 302
11.4.1 Stream Ensemble Framework (SE) 302
11.4.2 Selectively Recursive Approach (SERA) 303
11.4.3 Recursive Ensemble Approach (REA) 303
11.4.4 Boundary Definition Ensemble (BD) 303
11.4.5 Learn++.CDC (Concept Drift with SMOTE) 304
11.4.6 Ensemble of Online Cost-Sensitive Neural Networks (EONN) 304
11.4.7 Ensemble of Subset Online Sequential Extreme Learning Machines (ESOS-ELM) 304
11.4.8 Oversampling- and Undersampling-Based Online Bagging (OOB and UOB) 304
11.4.9 Dynamic Weighted Majority for Imbalance Learning (DWMIL) 305
11.4.10 Gradual Resampling Ensemble (GRE) 305
11.5 Evolving Number of Classes 305
11.5.1 Learn++.NovelClass (Learn++.NC) 306
11.5.2 Enhanced Classifier for Data Streams with Novel Class Miner (ECSMiner) 306
11.5.3 Multiclass Miner in Data Streams (MCM) 306
11.5.4 AnyNovel 307
11.5.5 Class-Based Ensemble for Class Evolution (CBCE) 307
11.5.6 Class Based Micro Classifier Ensemble (CLAM) and Stream Classifier And Novel and Recurring Class Detector (SCARN) 307
11.6 Access to Ground Truth 308
11.6.1 Online Active Learning with Bayesian Probit 308
11.6.2 Online Mean Score on Unlabeled Set (Online-MSU) 309
11.6.3 Cost-Sensitive Online Active Learning Under a Query Budget (CSOAL) 309
11.6.4 Online Active Learning with the Asymmetric Query Model 309
11.6.5 Genetic Programming Active Learning Framework (Stream-GP) 309
11.7 Summarizing Comments 310
References 311
12 Non-classical Imbalanced Classification Problems 315
12.1 Introduction 315
12.2 Semi-supervised Learning 316
12.2.1 Inductive Semi-supervised Learning 316
12.2.2 Transductive Learning 317
12.2.3 PU-Learning 318
12.2.4 Active Learning 318
12.3 Multilabel Learning 319
12.3.1 Imbalance Quantification 320
12.3.2 Methods for Dealing with Imbalance in MLL 321
12.3.2.1 Resampling 321
12.3.2.2 Algorithm Adaptation 322
12.3.2.3 Ensemble Learning 323
12.4 Multi-instance Learning 324
12.4.1 Methods for Dealing with Imbalance in MIL 325
12.4.1.1 Resampling 325
12.4.1.2 Problem Adaptation 326
12.4.1.3 Ensembles 326
12.5 Ordinal Classification and Regression 327
12.5.1 Imbalanced Regression 328
12.5.1.1 Under-sampling for Regression 329
12.5.1.2 SMOTE for Regression 330
12.5.2 Ordinal Classification of Imbalanced Data 330
12.5.2.1 Graph-Based Over-sampling 331
12.5.2.2 Cluster-Based Weighted Over-sampling 331
12.6 Summarizing Comments 331
References 332
13 Imbalanced Classification for Big Data 336
13.1 Introduction 336
13.2 Big Data: MapReduce Programming Model, Spark Framework and Machine Learning Libraries 338
13.2.1 Introduction to Big Data and MapReduce 338
13.2.2 Spark: A Novel Technological Approach for Iterative Processing in Big Data 340
13.2.3 Machine Learning Libraries for Big Data 342
13.2.3.1 Hadoop: Apache Mahout 342
13.2.3.2 Spark: MLlib and SparkPackages 342
13.3 Addressing Imbalanced Classification in Big Data Problems: Current State 343
13.3.1 Data Pre-processing Studies 344
13.3.1.1 Traditional Data Based Solutions for Big Data 344
13.3.1.2 Random OverSampling with Evolutionary Feature Weighting and Random Forest (ROSEFW-RF) 345
13.3.1.3 Evolutionary Undersampling 346
13.3.1.4 Data Cleaning 346
13.3.1.5 NRSBoundary-SMOTE 346
13.3.1.6 Extreme Learning Machine with Resampling 347
13.3.1.7 Multi-class Imbalance 347
13.3.1.8 Summary 348
13.3.2 Cost-Sensitive Learning Studies 348
13.3.2.1 Cost-Sensitive SVM 348
13.3.2.2 Instance Weighting SVM 348
13.3.2.3 Cost-Sensitive Random Forest 349
13.3.2.4 Cost-Sensitive Fuzzy Rule Based Classification System (FRBCS) 349
13.3.2.5 Summary 350
13.3.3 Applications on Imbalanced Big Data 350
13.3.3.1 Pairwise Ortholog Detection 350
13.3.3.2 Traffic Accidents Prediction 351
13.3.3.3 Biomedical Data 351
13.3.3.4 Human Activity Recognition 352
13.3.3.5 Fraud Detection 352
13.3.3.6 Summary 352
13.4 Challenges for Imbalanced Big Data Classification 353
13.5 Summarizing Comments 354
References 355
14 Software and Libraries for Imbalanced Classification 359
14.1 Introduction 359
14.2 Java Tools 360
14.2.1 KEEL Software Suite 361
14.2.2 Weka 363
14.3 R Packages 366
14.3.1 Package Unbalanced 366
14.3.2 Package Smotefamily 368
14.3.3 Package ROSE 369
14.3.4 Package DMwR 370
14.3.5 Package Imbalance 371
14.3.6 Package mlr: Cost-Sensitive Classification 375
14.4 Python Libraries 377
14.5 Big Data Software: Spark Packages 379
14.6 Summarizing Comments 382
References 383
Erscheint lt. Verlag | 22.10.2018 |
---|---|
Zusatzinfo | XVIII, 377 p. 71 illus., 50 illus. in color. |
Verlagsort | Cham |
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Web / Internet |
Schlagworte | Big Data • classification • cost-sensitive learning • Data Mining • Data preprocessing • Data reduction • Data streams • dimensionality reduction • ensemble learning • Imbalanced Data • machine learning |
ISBN-10 | 3-319-98074-2 / 3319980742 |
ISBN-13 | 978-3-319-98074-4 / 9783319980744 |
Haben Sie eine Frage zum Produkt? |
Größe: 11,3 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich