Data Analysis, Machine Learning and Applications (eBook)
XVI, 719 Seiten
Springer Berlin (Verlag)
978-3-540-78246-9 (ISBN)
Data analysis and machine learning are research areas at the intersection of computer science, artificial intelligence, mathematics and statistics. They cover general methods and techniques that can be applied to a vast set of applications such as web and text mining, marketing, medical science, bioinformatics and business intelligence. This volume contains the revised versions of selected papers in the field of data analysis, machine learning and applications presented during the 31st Annual Conference of the German Classification Society (Gesellschaft für Klassifikation - GfKl). The conference was held at the Albert-Ludwigs-University in Freiburg, Germany, in March 2007.
Preface 6
Contents 10
Part I Classification 19
Distance-based Kernels for Real-valued Data 20
1 Introduction 20
2 Kernels and similarities defined on real numbers 21
3 Semantics and applicability 22
4 Truncated Euclidean similarity 23
5 Canberra distance-based similarity 24
6 Kernels defined on real vectors 25
7 Conclusions 27
References 27
Fast Support Vector Machine Classification of Very Large Datasets 28
1 Introduction 28
2 Linear SVM trees 30
3 Non-linear extension 33
4 Experiments 33
5 Conclusion 34
References 35
Fusion of Multiple Statistical Classifiers 36
1 Introduction 36
2 Classifier fusion 37
3 Diversity of ensemble members 37
4 Combination rules 39
5 Open problems 41
6 Results of experiments 41
7 Conclusions 43
References 43
Calibrating Margin–based Classifier Scores into Polychotomous Probabilities 46
1 Introduction 46
2 Reduction to binary problems 47
3 Coupling probability estimates 47
4 Dirichlet calibration 48
5 Comparison 50
6 Conclusion 53
References 53
Classification with Invariant Distance Substitution Kernels 54
1 Introduction 54
2 Background 55
3 Adjustable invariance 57
4 Positive definiteness 58
5 Classification experiments 60
6 Conclusion 61
References 61
Applying the Kohonen Self-organizing Map Networks to Select Variables 62
1 Introduction 62
2 A proposition to reduce the number of variables 63
3 Applications and results 67
4 Conclusions 69
References 71
Computer Assisted Classification of Brain Tumors 72
1 Introduction 72
2 Algorithms 73
3 Results 75
4 Conclusions 76
References 76
Model Selection in Mixture Regression Analysis – A Monte Carlo Simulation Study 78
1 Introduction 78
2 Model selection in mixture models 79
3 Simulation design 80
4 Results summary 81
5 Key contributions and future research directions 83
References 84
Comparison of Local Classification Methods 86
1 Introduction 86
2 Local classification methods 87
3 Simulation study 89
4 Summary 93
References 93
Incorporating Domain Specific Information into Gaia Source Classification 94
1 Introduction 94
2 Classification and parametrization 95
3 Classification results 96
4 Summary 99
References 100
Identification of Noisy Variables for Nonmetric and Symbolic Data in Cluster Analysis 102
1 Introduction 102
2 Characteristics of the HINoV method and its modifications 103
3 Simulation models 104
4 Discussion on the simulation results 105
5 Conclusions 107
References 109
Part II Clustering 110
Families of Dendrograms 112
1 Introduction 112
2 A brief introduction to 113
adic geometry 113
3 115
adic dendrograms 115
4 The space of dendrograms 116
5 Distributions on dendrograms 117
6 Hidden vertices 118
7 Conclusions 118
Acknowledgements 119
References 119
Mixture Models in Forward Search Methods for Outlier Detection 120
1 Introduction 120
2 The Forward Search 121
3 Forward Search and Normal Mixture Models: the graphical approach 122
4 Forward Search and Normal Mixture Models: the inferential approach 123
5 Concluding remarks and open issues 126
References 127
On Multiple Imputation Through Finite Gaussian Mixture Models 128
1 Introduction 128
2 Multiple imputation 129
3 Label switching 132
4 Simulation study and results 133
References 134
Mixture Model Based Group Inference in Fused Genotype and Phenotype Data 136
1 Introduction 136
2 Methods 137
3 Results 140
4 Discussion 142
5 Acknowledgements 142
References 142
The Noise Component in Model- based Cluster Analysis 144
1 Introduction 144
2 Two variations on the noise component 149
3 Some theory 150
4 The EM-algorithm 152
5 Simulations 152
6 Conclusion 154
References 154
An Artificial Life Approach for Semi- supervised Learning 156
1 Introduction 156
2 Artificial life 157
3 Semi-supervised artificial life 159
4 Semi-Supervised artificial life for cluster analysis 160
5 Experimental settings and results 160
6 Discussion 161
7 Summary 162
References 163
Hard and Soft Euclidean Consensus Partitions 164
1 Introduction 164
2 Theory 166
3 Applications 168
References 170
Rationale Models for Conceptual Modeling 172
1 Subjectivism in the modeling process 172
2 The design rationale approach 173
3 Classification of rationale fragments 175
4 Conclusion 178
References 179
Measures of Dispersion and Cluster-Trees for Categorical Data 180
1 Motivation 180
2 Measures of dispersion 181
3 Segmentation 185
References 186
Information Integration of Partially Labeled Data 188
1 Introduction 188
2 Related work 189
3 Four problem classes 189
4 Method 191
5 Evaluation 194
6 Conclusion 195
References 196
Part III Multidimensional Data Analysis 198
Data Mining of an On-line Survey - A Market Research Application 200
1 Introduction 200
2 Data and objectives 200
3 Methodology and results 201
4 Conclusions 207
References 208
Nonlinear Constrained Principal Component Analysis in the Quality Control Framework 210
1 Introduction 210
2 Constrained principal component analysis 211
3 Nonlinear Constrained Principal Component Analysis 212
4 Stability analysis 214
5 Results and interpretation 214
6 Concluding remarks 216
References 217
Non Parametric Control Chart by Multivariate Additive Partial Least Squares via Spline 218
1 Introduction 218
2 Multivariate control charts based on projection methods 219
3 Application: monitoring the painting process of hot-rolled aluminium foils 222
4 Conclusion 224
References 224
Simple Non Symmetrical Correspondence Analysis 226
1 Introduction 226
2 Non symmetrical correspondence analysis 227
3 Simple non symmetrical correspondence analysis 228
4 Father’s and son’s occupations data 230
5 Conclusions 232
References 234
Factorial Analysis of a Set of Contingency Tables 236
1 Introduction 236
2 Methodology 237
3 Application 240
4 Discussion 242
5 Software notes 243
References 243
Part IV Analysis of Complex Data 244
Graph Mining: Repository vs. Canonical Form 246
1 Introduction 246
2 Canonical form pruning 247
3 Repository of processed subgraphs 248
4 Comparison 250
5 Experiments 251
6 Summary 252
References 253
Classification and Retrieval of Ancient Watermarks 254
1 Introduction 254
2 Feature extraction 255
3 Results 257
4 Conclusion 261
References 261
Segmentation and Classification of Hyper- Spectral Skin Data 262
1 Introduction 262
2 Labelling 263
3 Classification 265
4 Results 266
5 Conclusion 268
References 269
FSMTree: An Efficient Algorithm for Mining Frequent Temporal Patterns 270
1 Introduction 270
2 Foundations and related work 271
3 Algorithms FSMSet and FSMTree 273
4 Performance evaluation and conclusions 276
References 277
A Matlab Toolbox for Music Information Retrieval 278
1 Motivation and approach 278
2 Feature extraction 279
3 Data analysis 282
4 Application to the study of music and emotion 283
References 284
A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi- Agent Systems 286
1 Introduction 286
2 Framework for modeling and recognizing situations 287
3 Modeling situations 288
4 Recognizing situations 289
5 Evaluation 291
6 Conclusions and further work 292
References 293
Applying the Qn Estimator Online 294
1 Introduction 294
2 An update algorithm for the Qn and the HL estimator 295
3 Comparative study 299
4 Conclusions 300
References 301
A Comparative Study on Polyphonic Musical Time Series Using MCMC Methods 302
1 Introduction 302
2 Polyphonic model 302
3 Extended polyphonic model 304
4 Results 305
5 Conclusion 308
References 308
Collective Classification for Labeling of Places and Objects in 2D and 3D Range Data 310
1 Introduction 310
2 Related work 311
3 Collective classification 311
4 Feature extraction in 2D maps 313
5 Feature selection 313
6 Experiments 314
7 Conclusions 315
8 Acknowledgment 317
References 317
Lag or Error? - Detecting the Nature of Spatial Correlation 318
1 Introduction 318
2 Model and test statistics 319
3 Monte Carlo study 322
4 Results 322
References 324
Part V Exploratory Data Analysis and Tools for Data Analysis 326
Urban Data Mining Using Emergent SOM 328
1 Introduction 328
2 Inspection and transformation of data 329
3 Method 330
4 Results 332
5 Conclusion 333
References 335
The Konstanz Information Miner 336
1 Overview 336
2 Architecture 337
3 Repository 341
4 Extending 342
5 Conclusion 342
References 343
A Pattern Based Data Mining Approach 344
1 Current situation in data mining 344
2 Introduction to patterns 345
3 Some data mining patterns 347
4 Summary and outlook 349
References 351
A Framework for Statistical Entity Identification in 352
1 Introduction 352
2 Methodological framework 353
3 Implementation 354
4 Conclusion and future work 358
References 359
Combining Several SOM Approaches in Data Mining: Application to ADSL Customer Behaviours Analysis 360
1 Introduction 360
2 Network measurements and data description 361
3 Customer segmentation 363
4 Conclusion 370
References 371
On the Analysis of Irregular Stock Market Trading Behavior 372
1 Introduction 372
2 Irregular trading behavior in a market 373
3 Analysis of trading behavior with complex valued Eigensystem analysis 374
4 Analysis of the dataset 375
5 Conclusion 378
References 379
A Procedure to Estimate Relations in a Balanced Scorecard 380
1 Related work 380
2 Balanced scorecards 381
3 Model 383
4 Case study 384
5 Results 385
6 Conclusion and outlook 387
References 387
The Application of Taxonomies in the Context of Configurative Reference Modelling 390
1 Introduction 390
2 Configurative Reference Modelling and the application of taxonomies 391
3 Conclusion 395
4 Outlook 396
References 396
Two-Dimensional Centrality of a Social Network 398
1 Introduction 398
2 The procedure 399
3 The analysis and the result 399
4 Discussion 401
References 405
Benchmarking Open-Source Tree Learners in R/RWeka 406
1 Introduction 406
2 Design of the benchmark experiment 407
3 Results of the benchmark experiment 409
4 Discussion and further work 412
References 413
From Spelling Correction to Text Cleaning – Using Context Information 414
1 Introduction 414
2 Linguistics and context sensitivity 415
3 Framework for text preparation 416
4 Experimental results 419
5 Conclusion and future work 420
References 421
Root Cause Analysis for Quality Management 422
1 Introduction 422
2 Root Cause Analysis 424
3 Computational results 427
4 Conclusion 428
References 429
Finding New Technological Ideas and Inventions with Text Mining and Technique Philosophy 430
1 Introduction 430
2 A common structure for raw and context information 431
3 Relevant aspects for the text mining approach from technique philosophy 433
4 A text mining approach for 435
new ideas and inventions 435
5 Evaluation and outlook 436
6 Acknowledge 436
References 437
Investigating Classifier Learning Behavior with Experiment Databases 438
1 Introduction 438
2 A database for classification experiments 439
3 The experiments 440
4 Using the database 441
5 Conclusions 445
References 445
Part VI Marketing and Management Science 446
Conjoint Analysis for Complex Services Using Clusterwise Hierarchical Bayes Procedures 448
1 Introduction 448
2 Preference measurement for services 449
3 Hierarchical Bayes procedures for conjoint analysis 449
4 Empirical investigation 450
5 Conclusion and outlook 453
References 454
Building an Association Rules Framework for Target Marketing 456
1 Introduction 456
2 A segment-specific view of cross-category associations 457
3 Methodology 458
4 Empirical application 460
5 Conclusion and future work 463
References 463
AHP versus ACA – An Empirical Comparison 464
1 Preference measurement for complex products 464
2 The Analytic Hierarchy Process – AHP 465
3 Design of the empirical study 467
4 Results 468
5 Conclusions and outlook 470
References 471
On the Properties of the Rank Based Multivariate Exponentially Weighted Moving Average Control Charts 472
1 Introduction 472
2 Data depth 472
3 The proposed 473
control chart 473
4 Effect of the reference sample size on 475
control charts 475
performance 475
5 Conclusion 478
Acknowledgements 479
References 479
Are Critical Incidents Really Critical for a Customer Relationship? A MIMIC Approach 480
1 Introduction 480
2 Hypotheses 481
3 Method 483
4 Results 483
5 Discussion 485
References 486
Heterogeneity in the Satisfaction-Retention Relationship – A Finite- mixture Approach 488
1 Introduction 488
2 The Model 490
3 Discussion 494
References 494
An Early-Warning System to Support Activities in the Management of Customer Equity and How to Obtain the Most from Spatial Customer Equity Potentials 496
1 Introduction1 496
2 Strategic customer control dimensions 497
3 Early-warning system 500
4 Empirical example 502
5 Conclusion 503
References 503
Classifying Contemporary Marketing Practices 506
1 Introduction 506
2 Knowledge on interactive marketing 507
3 A Finite Mixture approach for classifying marketing practices 508
4 Empirical application 510
5 Conclusions 513
References 513
Part VII Banking and Finance 514
Predicting Stock Returns with Bayesian Vector Autoregressive Models 516
1 Introduction 516
2 Literature review 517
3 Model 518
4 Empirical study 519
5 Conclusion and outlook 522
References 523
The Evaluation of Venture-Backed IPOs – Certification Model versus Adverse Selection Model, Which Does Fit Better? 524
1 Introduction 524
2 The theoretical 525
background: the certification model 525
and the adverse selection model 525
3 Data set and non-parametric hypothesis tests 526
4 Multivariate investigation tools: Partial Least squares regression model 527
5 Conclusion 530
Acknowledgments 530
References 530
Using Multiple SVM Models for Unbalanced Credit Scoring Data Sets 532
1 Introduction 532
2 SVM models for unbalanced data sets 533
3 Multiple SVM for unbalanced data sets in practice 534
4 Combination of SVM on random input subsets 536
5 Conclusions and outlook 538
References 539
Part VIII Business Intelligence 540
Comparison of Recommender System Algorithms Focusing on the New- item and User- bias Problem 542
1 Introduction 542
2 Related works 543
3 Observed approaches 544
4 Evaluation protocols 546
5 Evaluation and experimental results 546
6 Conclusion 548
References 549
Collaborative Tag Recommendations 550
1 Introduction 550
2 Related work 551
3 Recommender Systems 552
4 Tag Recommender Systems 553
5 Experimental setup and results 554
6 Conclusions 556
7 Acknowledgments 557
References 557
Applying Small Sample Test Statistics for Behavior- based Recommendations 558
1 Introduction 558
2 The ideal decision maker: The decision maker without preferences 559
3 Library meta catalogs: An exemplary application area 560
4 Mathematical notation 561
5 POSICI: Probability Of Single Item Co-Inspections 561
6 POMICI: Probability Of Multiple Items Co-Inspections 562
7 POSICI vs. POMICI 564
8 Conclusions and further research 564
References 565
Part IX Text Mining, Web Mining, and the Semantic Web 568
Classifying Number Expressions in German Corpora 570
1 Introduction 570
2 Classification of number expressions 571
3 Experimental evaluation 574
4 Conclusions and future work 576
References 577
Non-Profit Web Portals - Usage Based Benchmarking for Success Evaluation 578
1 Introduction 578
2 Related work 579
3 Method 580
4 Case study 583
5 Conclusions 584
References 585
Text Mining of Supreme Administrative Court Jurisdictions 586
1 Introduction 586
2 Administrative Supreme Court jurisdictions 587
3 Investigations 587
4 Conclusion 592
References 593
Supporting Web-based Address Extraction with Unsupervised Tagging 594
1 Introduction 594
2 Data preparation 596
3 Unsupervised tagging 596
4 Experiments and evaluation 597
5 Conclusion and further work 600
References 600
A Two-Stage Approach for Context-Dependent Hypernym Extraction 602
1 Introduction 602
2 Document clustering 603
3 Hypernym extraction 604
4 Evaluation 606
5 Conclusion and future work 609
References 609
Analysis of Dwell Times in Web Usage Mining 610
1 Introduction 610
2 Model specification and estimation 611
3 Real life example 614
4 Conclusion 615
References 617
New Issues in Near-duplicate Detection 618
1 Introduction 618
2 Fingerprint construction 620
3 Wikipedia as evaluation corpus 623
4 Summary 625
References 625
Comparing the University of South Florida Homograph Norms with Empirical Corpus Data 628
1 Introduction 628
2 Resources 629
3 Approach 630
4 Results and discussion 632
5 Conclusions and future work 634
Acknowledgments 635
References 635
Content-based Dimensionality Reduction for Recommender Systems 636
1 Introduction 636
2 Related work 637
3 The proposed approach 637
4 Performance study 641
5 Conclusions 643
References 643
Part X Linguistics 644
The Distribution of Data in Word Lists and its Impact on the Subgrouping of Languages 646
1 General situation 646
2 Special situation 647
3 The bias 649
4 Solution and operationalization 651
5 Discussion 651
6 Conclusions 652
References 652
Quantitative Text Analysis Using L-, F- and T- Segments 654
1 Introduction 654
2 Data 655
3 Distribution of segment types 656
4 Length distribution of L-segments 657
5 TTR studies 659
6 Conclusion 661
References 661
Projecting Dialect Distances to Geography: Bootstrap Clustering vs. Noisy Clustering 664
1 Introduction 664
2 Background and motivation 665
3 Bootstrapping clustering 667
4 Clustering with noise 667
5 Projecting to geography 668
6 Results 669
7 Discussion 669
Acknowledgments 670
References 670
Structural Differentiae of Text Types – A Quantitative Model 672
1 Introduction 672
2 Category selection 673
3 The evaluation procedure 674
4 Exploring the structural homogeneity of text types by means of the Iterative Categorisation Procedure ( ICP) 675
5 Results 676
6 Discussion 676
7 Conclusion 678
References 679
Part XI Data Analysis in Humanities 680
Scenario Evaluation Using Two-mode Clustering Approaches in Higher Education 682
1 Introduction: Scenario analysis 682
2 Two-Mode clustering (for scenario evaluation) 683
3 Example: Scenario evaluation in higher education 685
4 Conclusions 688
References 688
Visualization and Clustering of Tagged Music Data 690
1 Introduction 690
2 Related work 691
3 Emergent Self Organizing Maps 691
4 Data 692
5 Experimental results 694
6 Conclusion and future work 696
References 696
Effects of Data Transformation on Cluster Analysis of Archaeometric Data 698
1 Introduction 698
2 Data transformation in archaeometry 699
3 Transformation into ranks 700
4 Distances and cluster analysis 701
5 Romano-British vessel glass classified 702
6 Roman bricks and tiles classified 703
7 Summary 704
References 704
Fuzzy PLS Path Modeling: A New Tool For Handling Sensory Data 706
1 Introduction 706
2 Fuzzy PLS path modeling 707
3 Application 710
4 Conclusion 712
References 713
Automatic Analysis of Dewey Decimal Classification Notations 714
1 Introduction 714
2 DDC notations 715
3 Automatic analysis of DDC notations 716
4 Results 719
5 Conclusion 720
References 721
A New Interval Data Distance Based on the Wasserstein Metric 722
1 Introduction 722
2 A brief survey of the existing distances 723
3 Our proposal: Wasserstein distance 724
4 Dynamic clustering algorithm using different criterion functions 726
5 Conclusion and perspectives 727
References 728
Keywords 730
Author Index 734
Erscheint lt. Verlag | 13.4.2008 |
---|---|
Reihe/Serie | Studies in Classification, Data Analysis, and Knowledge Organization | Studies in Classification, Data Analysis, and Knowledge Organization |
Zusatzinfo | XVI, 719 p. |
Verlagsort | Berlin |
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Datenbanken |
Mathematik / Informatik ► Mathematik ► Statistik | |
Technik | |
Wirtschaft ► Betriebswirtschaft / Management ► Wirtschaftsinformatik | |
Schlagworte | Artificial Intelligence • Bioinformatics • Business Intelligence • classification • Clustering • Computer Science • Data Analysis • Intelligence • learning • Linguistics • machine learning • semantic web • service-oriented computing • Web mining |
ISBN-10 | 3-540-78246-X / 354078246X |
ISBN-13 | 978-3-540-78246-9 / 9783540782469 |
Haben Sie eine Frage zum Produkt? |
Größe: 10,7 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich