Data Mining (eBook)

Special Issue in Annals of Information Systems

Robert Stahlbock, Sven F. Crone, Stefan Lessmann (Herausgeber)

eBook Download: PDF

2009
XIII, 387 Seiten
Springer US (Verlag)
978-1-4419-1280-0 (ISBN)

This special issue of Annals of Information Systems contains original papers and substantial extensions of selected papers from the 2007 and 2008 International Conference on Data Mining (DMIN'07 and DMIN'08, Las Vegas, NV) that have been rigorously peer-reviewed. The issue brings together topics on both information systems and data mining, and aims to give the reader a current snapshot of the contemporary research and state of the art practice in data mining.

Over the course of the last twenty years, research in data mining has seen a substantial increase in interest, attracting original contributions from various disciplines including computer science, statistics, operations research, and information systems. Data mining supports a wide range of applications, from medical decision making, bioinformatics, web-usage mining, and text and image recognition to prominent business applications in corporate planning, direct marketing, and credit scoring. Research in information systems equally reflects this inter- and multidisciplinary approach, thereby advocating a series of papers at the intersection of data mining and information systems research.This special issue of Annals of Information Systems contains original papers and substantial extensions of selected papers from the 2007 and 2008 International Conference on Data Mining (DMIN 07 and DMIN 08, Las Vegas, NV) that have been rigorously peer-reviewed. The issue brings together topics on both information systems and data mining, and aims to give the reader a current snapshot of the contemporary research and state of the art practice in data mining.

Preface 5
Contents 7
1 Data Mining and Information Systems: Quo Vadis? 14
Robert Stahlbock, Stefan Lessmann, and Sven F. Crone 14
1.1 Introduction 14
1.2 Special Issues in Data Mining 16
1.2.1 Confirmatory Data Analysis 16
1.2.2 Knowledge Discovery from Supervised Learning 17
1.2.3 Classification Analysis 19
1.2.4 Hybrid Data Mining Procedures 21
1.2.5 Web Mining 23
1.2.6 Privacy-Preserving Data Mining 24
1.3 Conclusion and Outlook 25
References 26
Part I Confirmatory Data Analysis 29
2 Response-Based Segmentation Using Finite Mixture Partial Least Squares 30
Christian M. Ringle, Marko Sarstedt, and Erik A. Mooi 30
2.1 Introduction 31
2.1.1 On the Use of PLS Path Modeling 31
2.1.2 Problem Statement 33
2.1.3 Objectives and Organization 34
2.2 Partial Least Squares Path Modeling 35
2.3 Finite Mixture Partial Least Squares Segmentation 37
2.3.1 Foundations 37
2.3.2 Methodology 39
2.3.3 Systematic Application of FIMIX-PLS 42
2.4 Application of FIMIX-PLS 45
2.4.1 On Measuring Customer Satisfaction 45
2.4.2 Data and Measures 45
2.4.3 Data Analysis and Results 47
2.5 Summary and Conclusion 55
References 56
Part II Knowledge Discovery from Supervised Learning 61
3 Building Acceptable Classification Models 62
David Martens and Bart Baesens 62
3.1 Introduction 63
3.2 Comprehensibility of Classification Models 64
3.2.1 Measuring Comprehensibility 66
3.2.2 Obtaining Comprehensible Classification Models 67
3.2.2.1 Building Rule-Based Models 67
3.2.2.2 Combining Output Types 67
3.2.2.3 Visualization 67
3.3 Justifiability of Classification Models 68
3.3.1 Taxonomy of Constraints 69
3.3.2 Monotonicity Constraint 71
3.3.3 Measuring Justifiability 72
3.3.4 Obtaining Justifiable Classification Models 77
3.4 Conclusion 79
References 80
4 Mining Interesting Rules Without Support Requirement: A General Universal Existential Upward Closure Property 84
Yannick Le Bras, Philippe Lenca, and Stéphane Lallich 84
4.1 Introduction 85
4.2 State of the Art 86
4.3 An Algorithmic Property of Confidence 89
4.3.1 On UEUC Framework 89
4.3.2 The UEUC Property 89
4.3.3 An Efficient Pruning Algorithm 90
4.3.4 Generalizing the UEUC Property 91
4.4 A Framework for the Study of Measures 93
4.4.1 Adapted Functions of Measure 93
4.4.1.1 Association Rules 93
4.4.1.2 Contingency Tables 93
4.4.1.3 Minimal Joint Domain 1
4.4.2 Expression of a Set of Measures of Ddconf 96
4.5 Conditions for GUEUC 99
4.5.1 A Sufficient Condition 99
4.5.2 A Necessary Condition 100
4.5.3 Classification of the Measures 101
4.6 Conclusion 103
References 104
5 Classification Techniques and Error Control in Logic Mining 108
Giovanni Felici, Bruno Simeone, and Vincenzo Spinelli 108
5.1 Introduction 109
5.2 Brief Introduction to Box Clustering 111
5.3 BC-Based Classifier 113
5.4 Best Choice of a Box System 117
5.5 Bi-criterion Procedure for BC-Based Classifier 120
5.6 Examples 121
5.6.1 The Data Sets 121
5.6.2 Experimental Results with BC 122
5.6.3 Comparison with Decision Trees 124
5.7 Conclusions 126
References 126
Part III Classification Analysis 129
6 An Extended Study of the Discriminant Random Forest 130
Tracy D. Lemmond, Barry Y. Chen, Andrew O. Hatch,and William G. Hanley 130
6.1 Introduction 130
6.2 Random Forests 131
6.3 Discriminant Random Forests 132
6.3.1 Linear Discriminant Analysis 133
6.3.2 The Discriminant Random Forest Methodology 134
6.4 DRF and RF: An Empirical Study 135
6.4.1 Hidden Signal Detection 136
6.4.1.1 Training on T1, Testing on J2 137
6.4.1.2 Prediction Performance for J2 with Cross-validation 138
6.4.2 Radiation Detection 139
6.4.3 Significance of Empirical Results 143
6.4.4 Small Samples and Early Stopping 144
6.4.5 Expected Cost 150
6.5 Conclusions 150
References 152
7 Prediction with the SVM Using Test Point Margins 154
Süreyya Özögür-Akyüz, Zakria Hussain, and John Shawe-Taylor 154
7.1 Introduction 154
7.2 Methods 158
7.3 Data Set Description 161
7.4 Results 161
7.5 Discussion and Future Work 162
References 164
8 Effects of Oversampling Versus Cost-Sensitive Learning for Bayesian and SVM Classifiers 166
Alexander Liu, Cheryl Martin, Brian La Cour, and Joydeep Ghosh 166
8.1 Introduction 166
8.2 Resampling 168
8.2.1 Random Oversampling 168
8.2.2 Generative Oversampling 168
8.3 Cost-Sensitive Learning 169
8.4 Related Work 170
8.5 A Theoretical Analysis of Oversampling Versus Cost-Sensitive Learning 171
8.5.1 Bayesian Classification 171
8.5.2 Resampling Versus Cost-Sensitive Learning in Bayesian Classifiers 172
8.5.3 Effect of Oversampling on Gaussian Naive Bayes 173
8.5.3.1 Random Oversampling 174
8.5.3.2 Generative Oversampling 174
8.5.3.3 Comparison to Cost-Sensitive Learning 175
8.5.4 Effects of Oversampling for Multinomial Naive Bayes 175
8.6 Empirical Comparison of Resampling and Cost-SensitiveLearning 177
8.6.1 Explaining Empirical Differences Between Resampling and Cost-Sensitive Learning 177
8.6.2 Naive Bayes Comparisons on Low-Dimensional Gaussian Data 178
8.6.2.1 Gaussian Naive Bayes on Artificial, Low-Dimensional Data 179
8.6.2.2 A Note on ROC and AUC 180
8.6.2.3 Gaussian Naive Bayes on Real, Low-Dimensional Data 1
8.6.3 Multinomial Naive Bayes 183
8.6.4 SVMs 185
8.6.5 Discussion 188
8.7 Conclusion 189
Appendix 190
References 197
9 The Impact of Small Disjuncts on Classifier Learning 200
Gary M. Weiss 200
9.1 Introduction 200
9.2 An Example: The Vote Data Set 202
9.3 Description of Experiments 204
9.4 The Problem with Small Disjuncts 205
9.5 The Effect of Pruning on Small Disjuncts 209
9.6 The Effect of Training Set Size on Small Disjuncts 217
9.7 The Effect of Noise on Small Disjuncts 220
9.8 The Effect of Class Imbalance on Small Disjuncts 224
9.9 Related Work 227
9.10 Conclusion 230
References 232
Part IV Hybrid Data Mining Procedures 234
10 Predicting Customer Loyalty Labels in a Large Retail Database: A Case Study in Chile 235
Cristián J. Figueroa 235
10.1 Introduction 235
10.2 Related Work 237
10.3 Objectives of the Study 239
10.3.1 Supervised and Unsupervised Learning 240
10.3.2 Unsupervised Algorithms 240
10.3.2.1 Self-Organizing Map 240
10.3.2.2 Sammon Mapping 242
10.3.2.3 Curvilinear Component Analysis 243
10.3.3 Variables for Segmentation 244
10.3.4 Exploratory Data Analysis 245
10.3.5 Results of the Segmentation 246
10.4 Results of the Classifier 247
10.5 Business Validation 250
10.5.1 In-Store Minutes Charges for Prepaid Cell Phones 251
10.5.2 Distribution of Products in the Store 252
10.6 Conclusions and Discussion 254
Appendix 256
References 258
11 PCA-Based Time Series Similarity Search 260
Leonidas Karamitopoulos, Georgios Evangelidis, and Dimitris Dervos 260
11.1 Introduction 261
11.2 Background 263
11.2.1 Review of PCA 263
11.2.2 Implications of PCA in Similarity Search 264
11.2.3 Related Work 266
11.3 Proposed Approach 268
11.4 Experimental Methodology 270
11.4.1 Data Sets 270
11.4.2 Evaluation Methods 271
11.4.3 Rival Measures 272
11.5 Results 273
11.5.1 1-NN Classification 273
11.5.2 k-NN Similarity Search 276
11.5.3 Speeding Up the Calculation of APEdist 277
11.6 Conclusion 279
References 279
12 Evolutionary Optimization of Least-Squares Support Vector Machines 282
Arjan Gijsberts, Giorgio Metta, and Léon Rothkrantz 282
12.1 Introduction 283
12.2 Kernel Machines 283
12.2.1 Least-Squares Support Vector Machines 284
12.2.2 Kernel Functions 285
12.2.2.1 Conditions for Kernels 285
12.3 Evolutionary Computation 286
12.3.1 Genetic Algorithms 286
12.3.2 Evolution Strategies 287
12.3.3 Genetic Programming 288
12.4 Related Work 288
12.4.1 Hyperparameter Optimization 289
12.4.2 Combined Kernel Functions 289
12.5 Evolutionary Optimization of Kernel Machines 291
12.5.1 Hyperparameter Optimization 291
12.5.2 Kernel Construction 292
12.5.3 Objective Function 293
12.6 Results 294
12.6.1 Data Sets 294
12.6.2 Results for Hyperparameter Optimization 295
12.6.3 Results for EvoKMGP 298
12.7 Conclusions and Future Work 299
References 300
13 Genetically Evolved kNN Ensembles 303
Ulf Johansson, Rikard König, and Lars Niklasson 303
13.1 Introduction 303
13.2 Background and Related Work 305
13.3 Method 306
13.3.1 Data sets 309
13.4 Results 311
13.5 Conclusions 316
References 317
Part V Web-Mining 318
14 Behaviorally Founded Recommendation Algorithm for Browsing Assistance Systems 319
Peter Géczy, Noriaki Izumi, Shotaro Akaho, and Kôiti Hasida 319
14.1 Introduction 319
14.1.1 Related Works 320
14.1.2 Our Contribution and Approach 321
14.2 Concept Formalization 321
14.3 System Design 325
14.3.1 A Priori Knowledge of Human--System Interactions 325
14.3.2 Strategic Design Factors 325
14.3.3 Recommendation Algorithm Derivation 327
14.4 Practical Evaluation 329
14.4.1 Intranet Portal 330
14.4.2 System Evaluation 332
14.4.3 Practical Implications and Limitations 333
14.5 Conclusions and Future Work 334
References 335
15 Using Web Text Mining to Predict Future Events: A Testof the Wisdom of Crowds Hypothesis 337
Scott Ryan and Lutz Hamel 337
15.1 Introduction 337
15.2 Method 339
15.2.1 Hypotheses and Goals 339
15.2.2 General Methodology 341
15.2.3 The 2006 Congressional and Gubernatorial Elections 341
15.2.4 Sporting Events and Reality Television Programs 342
15.2.5 Movie Box Office Receipts and Music Sales 343
15.2.6 Replication 344
15.3 Results and Discussion 345
15.3.1 The 2006 Congressional and Gubernatorial Elections 345
15.3.2 Sporting Events and Reality Television Programs 347
15.3.3 Movie and Music Album Results 349
15.4 Conclusion 350
References 351
Part VI Privacy-Preserving Data Mining 353
16 Avoiding Attribute Disclosure with the (Extended) p-Sensitive k-Anonymity Model 354
Traian Marius Truta and Alina Campan 354
16.1 Introduction 354
16.2 Privacy Models and Algorithms 355
16.2.1 The p-Sensitive k-Anonymity Model and Its Extension 355
16.2.2 Algorithms for the p-Sensitive k-Anonymity Model 358
16.3 Experimental Results 361
16.3.1 Experiments for p-Sensitive k-Anonymity 361
16.3.2 Experiments for Extended p-Sensitive k-Anonymity 363
16.4 New Enhanced Models Based on p-Sensitive k-Anonymity 367
16.4.1 Constrained p-Sensitive k-Anonymity 367
16.4.2 p-Sensitive k-Anonymity in Social Networks 371
16.5 Conclusions and Future Work 373
References 373
17 Privacy-Preserving Random Kernel Classification of Checkerboard Partitioned Data 375
Olvi L. Mangasarian and Edward W. Wild 375
17.1 Introduction 375
17.2 Privacy-Preserving Linear Classifier for Checkerboard Partitioned Data 379
17.3 Privacy-Preserving Nonlinear Classifier for Checkerboard Partitioned Data 381
17.4 Computational Results 382
17.5 Conclusion and Outlook 384
References 386

Erscheint lt. Verlag	10.11.2009
Reihe/Serie	Annals of Information Systems
Reihe/Serie	Annals of Information Systems
Zusatzinfo	XIII, 387 p.
Verlagsort	New York
Sprache	englisch
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Technik
	Wirtschaft ► Allgemeines / Lexika
	Wirtschaft ► Betriebswirtschaft / Management ► Planung / Organisation
	Wirtschaft ► Betriebswirtschaft / Management ► Unternehmensführung / Management
Schlagworte	Business Intelligence • classification • Data Analysis • Data Mining • Engineering Economics • Knowledge Discovery • Statistics • Text Mining
ISBN-10	1-4419-1280-0 / 1441912800
ISBN-13	978-1-4419-1280-0 / 9781441912800

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 8,0 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

CHF 247,15