Classification - the Ubiquitous Challenge (eBook)
XX, 704 Seiten
Springer Berlin (Verlag)
978-3-540-28084-2 (ISBN)
Preface 6
Contents 14
Part I (Semi-) Plenary Presentations 22
Classification and Data Mining in Musicology 24
1 Introduction 24
2 Music,1/f-noise, fractal and chaos 24
3 Music and entropy 25
4 Score information and performance 27
Acknowledgements 28
References 31
Bayesian Mixed Membership Models for Soft Clustering and Classi.cation 32
1 Introduction 32
2 Mixed membership models 35
3 Disability types among older adults 37
3.1 National Long Term Care Survey 37
3.2 Applying the mixed membership model 38
4 Classifying publications by topic 39
4.1 Proceedings of the National Academy of Sciences 39
4.2 Applying the mixed membership model 40
4.3 An alternative approach with related data 43
4.4 Choosing K to describe PNAS topics 43
5 Summary and concluding remarks 44
Acknowledgments 45
References 45
Predicting Protein Secondary Structure with Markov Models 48
1 Introduction 48
2 Themethod 49
3 Improvements 50
4 Ongoing research 53
5 Summary 54
References 54
Milestones in the History of Data Visualization: A Case Study in Statistical Historiography 55
1 Introduction 55
1.1 The Milestones Project 56
2 Milestones tour 57
2.1 1600-1699: Measurement and theory 57
2.2 1700-1799: New graphic forms 58
2.3 1800-1850: Beginnings of modern graphics 58
2.4 1850-1900: The Golden Age of statistical graphics 60
2.5 1900-1950: The modern dark ages 60
2.6 1950-1975: Re-birth of data visualization 61
3 Problems and methods in statistical historiography 62
3.1 What counts as a Milestone? 62
3.2 Who gets credit? 63
3.3 Dating milestones 63
3.4 What is milestones “data” 64
3.5 Analyzing milestones “data” 64
3.6 What was he thinking?: Understanding through reproduction 64
3.7 What kinds of tools are needed? 65
4 How to visualize a history? 66
4.1 Lessons from the past 67
4.2 Lessons from the present 68
4.3 Lessons from the web 69
4.4 Lessons from the data visualization 70
Acknowledgments 70
References 71
Quantitative Text Typology: The Impact of Word Length 74
1 Introduction: Structuring the universe of texts 74
1.1 Classification and quantification 74
1.2 Quantitative text analysis: From a de.nition of the basics towards data homogeneity 75
1.3 Word length in a synergetic context 76
1.4 Qualitative and quantitative classi.cations: A priori and a posteriori 77
2 A case study: Classifying 398 Slovenian texts 78
2.1 Post hoc analysis of mean word length 80
2.2 Discriminant analyses: The whole corpus 80
2.3 From four to two letter types 81
2.4 Towards a new typology 82
2.5 Conclusion 85
References 85
Cluster Ensembles 86
1 Introduction 86
2 Consensus partitions 88
3 Extensions 91
References 92
Bootstrap Confidence Intervals for Three-way Component Methods 94
1 Introduction 94
2 The bootstrap for fully determined solutions 95
3 Smaller bootstrap intervals using transformations 98
4 Performance of bootstrap confidence intervals 99
5 An application: Bootstrap confidence intervals for results from a Tucker3 Analysis 100
6 Discussion 102
References 104
Organising the Knowledge Space for Software Components 106
1 Introduction 106
2 The software development process 107
3 A knowledge space for software development 109
4 Organising the knowledge space 110
4.1 Ontologies 110
4.2 A discovery and composition ontology 111
4.3 Description of components 112
4.4 Discovery and composition of components 113
5 Conclusions 116
References 116
Multimedia Pattern Recognition in Soccer Video Using Time Intervals 118
1 Introduction 118
2 Multimedia event classification framework 119
2.1 Pattern representation 120
2.2 Pattern classification 122
3 Highlight event classification in soccer broadcasts 124
4 Evaluation 126
4.1 Evaluation criteria 126
4.2 Classification results 127
5 Conclusion 128
References 129
Quantitative Assessment of the Responsibility for the Disease Load in a Population 130
1 Introduction 130
2 Basic definitions of attributable risk 131
3 Crude and adjusted attributable risk 132
4 Sequential attributable risk 133
5 Partial attributable risk 134
6 Illustrative example: The G.R.I.P.S. Study 135
7 Conclusion 136
Acknowledgment 137
References 137
Part II Classification and Data Analysis 140
Bootstrapping Latent Class Models 142
1 Introduction 142
2 Bootstrap analysis 143
3 Bootstrap analysis in .nite mixture models 144
4 An application to the latent class model 145
5 Conclusion 149
References 149
Dimensionality of Random Subspaces 150
1 Introduction 150
2 Model aggregation 151
3 Random Subspace Method 152
4 Feature selection for ensembles 153
5 Proposed method 154
6 Related work 154
7 Experiments 155
8 Summary 156
References 156
Two-stage Classification with Automatic Feature Selection for an Industrial Application 158
1 Introduction 158
2 Two-stage classification 159
2.1 Motivation 159
2.2 First stage – object classification 160
2.3 Second stage – image sequence classification 160
2.4 Polynomial classifier 160
3 System optimization 161
3.1 Wrapper approach 161
3.2 Search strategies in feature subsets 162
3.3 Efficiency 162
4 Experimental results 163
5 Conclusion and outlook 165
References 165
Bagging, Boosting and Ordinal Classification 166
1 Introduction 166
2 Aggregating classi.ers 166
3 Ordinal prediction 168
4 Empirical studies 171
5 Concluding remarks 172
References 173
A Method for Visual Cluster Validation 174
1 Introduction 174
2 Optimal projection for separation 176
3 Optimal projection for heterogeneity 177
4 Example 178
5 Conclusion 181
References 181
Empirical Comparison of Boosting Algorithms 182
1 Introduction 182
2 Arcing algorithms 183
2.1 Adaboost 183
2.2 Arcing family 185
3 Empirical study 185
3.1 Base classi.er and performance measure 186
3.2 Results 186
4 Conclusion 187
References 188
Iterative Majorization Approach to the Distance-based Discriminant Analysis 189
1 Introduction 189
2 Problem formulation 190
3 Iterative majorization 191
4 Dimensionality reduction and multiple-class setting 193
5 Experimental results 194
References 196
An Extension of the CHAID Tree-based Segmentation Algorithm to Multiple Dependent Variables 197
1 Background and summary of approach 197
2 The CHAID algorithm 198
3 Latent class modeling 200
4 The hybrid CHAID algorithm 201
5 Empirical example 202
6 Final comments 203
References 204
Expectation of Random Sets and the ‘Mean Values’ of Interval Data 205
1 Introduction 205
2 Reduction to characteristic points 206
3.1 The Aumann expectation 207
3.2 The Frechet expectation 207
3.3 The Doss expectation 208
3.4 The Vorob’ev expectation 208
4 Expectations of Random Closed Rectangles 209
4.1 The Aumann expectation 209
4.2 The Frechet expectation 211
4.3 The Doss expectation 211
4.4 The Vorob’ev expectation 211
5 Discussion 212
References 212
Experimental Design for Variable Selection in Data Bases 213
1 Introduction 213
2 Data 214
3 Plackett-Burman designs 214
4 Results 216
4.1 Stepwise regression by forward selection 216
4.2 Classification methods 216
4.3 Variable assessment 217
5 Conclusion 220
References 220
KMC/EDAM: A New Approach for the Visualization of K-Means Clustering Results 221
1 Introduction 221
2 Methods 222
2.1 Preliminaries 222
2.2 Basic idea 223
2.3 KMC/EDAM 223
3 Examples 225
4 Conclusion 228
References 228
Clustering of Variables with Missing Data: Application to Preference Studies 229
1 Introduction 229
2 Clustering of variables around latent components 230
3 Imputation methods 230
3.1 Direct imputation methods 230
3.2 Imputation within each cluster 230
3.3 Method based on a cross-partition 231
4 Illustration: data set ’jam’ 232
5 Simulation study 233
5.1 Jam data set 233
5.2 Simulated data 233
5.3 Criterion for comparison 233
5.4 Results 234
6 Conclusion 236
Acknowledgment 236
References 236
Binary On-line Classification Based on Temporally Integrated Information 237
1 General framework 237
1.1 Data format 238
1.2 On-line classification 238
2 Integration of information across time 239
3 Application 240
3.1 Neurophysiology 241
3.2 Model 241
3.3 Results 243
References 244
Different Subspace Classification 245
1 Introduction 245
2 Notationandmethod 246
2.1 Characteristic regions 246
2.2 Classification rule 247
3 Visualization 248
4 Parameter choice for DiSCo 249
4.1 Building the regions 249
4.2 Optimizing the thresholds 250
5 Simulation study 250
5.1 Data generation 250
5.2 Results 251
6 Summary 252
References 252
Density Estimation and Visualization for Data Containing Clusters of Unknown Structure 253
1 Introduction 253
2 Information optimal sets, Pareto Radius, PDE 254
3 PDE in one dimension: PDEplot 256
4 Measuring and visualization of density of high dimensional data 257
5 Summary 259
References 260
Hierarchical Mixture Models for Nested Data Structures 261
1 Introduction 261
2 Model formulation 262
2.1 Standard finite mixture model 262
2.2 Hierarchical finite mixture model 262
3 Maximum likelihood estimation by an adapted EM algorithm 264
4 An empirical example 265
5 Variants and extensions 267
References 268
Iterative Proportional Scaling Based on a Robust Start Estimator 269
1 Introduction 269
2 Covariance selection models 270
3 Iterative proportional scaling (IPS) 271
4 IPS robustified 272
5 Model selection with RIPS 273
6 Open questions 276
References 276
Exploring Multivariate Data Structures with Local Principal Curves 277
1 Introduction 277
2 Local principal curves 278
3 Simulated data examples 281
5 Conclusion 283
References 283
A Three-way Multidimensional Scaling Approach to the Analysis of Judgments About Persons 285
1 Introduction 285
2 The structure of judgments about persons 285
3 ‘SUMM–ID’ model 286
4 Application 290
5 Concluding remarks 291
References 292
Discovering Temporal Knowledge in Multivariate Time Series 293
1 Introduction 293
2 Data 294
3 Unification-based Temporal Grammar 294
4 Time Series Knowledge Mining 296
5 Discussion 298
6 Summary 299
Acknowledgements 299
References 300
A New Framework for Multidimensional Data Analysis 301
1 Information in data 301
2 Illustrative example 302
3 Geometric model for categorical data 304
4 Squared item-component correlation 304
5 Correlation between multidimensional variables 305
6 Decomposition of information in data and total information 306
7 Conclusion 307
References 308
External Analysis of Two-mode Three-way Asymmetric Multidimensional Scaling 309
1 Introduction 309
2 Themethod 310
3 An application 311
4 Discussion 314
References 316
The Relevance Vector Machine Under Covariate Measurement Error 317
1 Introduction 317
2 Nonparametric regression using the RVM 318
2.1 The RVM model setup 318
2.2 Inference 319
3 Covariate measurement error and its correction 320
3.1 The classical error model 320
3.2 Error correction using regression calibration 320
3.3 Error correction using SIMEX 321
3.4 Simulation results for the SIMEX 322
4 Discussion 323
Acknowledgements 324
References 324
Part III Applications 326
A Contribution to the History of Seriation in Archaeology 328
1 Introduction 328
2 The early years 328
3 Mathematicalmodels 329
4 The method of Brainerd and Robinson 330
5 Permutation search 331
6 Towards correspondence analysis 332
References 335
Model-based Cluster Analysis of Roman Bricks and Tiles from Worms and Rheinzabern 338
1 Introduction and task 338
2 Model-based Gaussian clustering 340
3 Results and archaeological discussion 342
4 Conclusion 345
References 345
Astronomical Object Classification and Parameter Estimation with the Gaia Galactic Survey Satellite 346
1 The Gaia Galactic survey mission 346
2 Astrophysical data 346
3 Classification challenges 347
4 Outlook 348
References 349
Design of Astronomical Filter Systems for Stellar Classification Using Evolutionary Algorithms 351
1 Astrophysical context 351
2 The optimization model 352
2.1 Parametrization 352
2.2 Figure-of-merit (fitness) 353
2.3 Evolutionary algorithm 354
3 Application, results and interpretation 355
4 Conclusions and future work 358
References 358
Analyzing Microarray Data with the Generative Topographic Mapping Approach 359
1 Introduction 359
2 Data structure 360
3 The GTM approach 361
4 Application to a data set 363
5 Summary and outlook 365
References 366
Test for a Change Point in Bernoulli Trials with Dependence 367
1 Introduction 367
2 Test problem 368
3 Intercalary independence of Markov processes 370
4 Strategies for performing a test 371
5 Example 372
References 373
Data Mining in Protein Binding Cavities 375
1 Introduction 375
2 Other approaches 376
3 Theory and algorithm 377
4 First results 379
5 Conclusions 380
References 381
Classification of In Vivo Magnetic Resonance Spectra 383
1 Introduction 383
2 Data 384
2.1 General features 384
2.2 Details 384
3 Methods 385
3.1 Evaluated algorithms 386
3.2 Benchmark settings 387
4 Results 388
5 Conclusions 390
References 390
Modifying Microarray Analysis Methods for Categorical Data – SAM and PAM for SNPs 391
1 Introduction 391
2 Multiple testing and the false discovery rate 392
3 Significance analysis of microarrays 393
4 SAM applied to single nucleotide polymorphisms 394
5 Prediction analysis of microarrays 395
6 Prediction analysis of SNPs 396
7 Discussion 397
References 398
Improving the Identification of Differentially Expressed Genes in cDNA Microarray Experiments 399
1 Introduction 399
2 Data sets, LogRatio, RelDi. 400
3 Comparison of LogRatio and RelDi. 401
4 Stabilization of variance 405
5 Summary 406
References 406
PhyNav: A Novel Approach to Reconstruct Large Phylogenies 407
1 Introduction 407
2 Minimal k-distance subsets 408
3 The PhyNav algorithm 409
4 The efficiency of PhyNav 409
4.1 Simulated datasets 410
4.2 Biological datasets 411
5 Discussion and conclusion 412
Acknowledgments 413
References 413
NewsRec, a Personal Recommendation System for News Websites 415
1 Introduction 415
2 Requirements, system design, and implementation details 417
3 Website classi.cation and evaluation measures 418
4 Empirical results 419
5 Conclusions and outlook 420
References 422
Clustering of Large Document Sets with Restricted Random Walks on Usage Histories 423
1 Motivation 423
2 Clustering with purchase histories 424
3 Time complexity 428
4 Results 428
5 Outlook 430
References 430
Fuzzy Two-mode Clustering vs. Collaborative Filtering 431
1 Introduction 431
2 Two-mode data analysis 432
2.1 Memory-based Collaborative Filtering (CF) 432
2.2 (Fuzzy) Two-Mode Clustering (FTMC) 433
3 The Delta-Method for fuzzy two-mode clustering 434
4 Examples and comparisons 435
5 Conclusions 437
References 437
Web Mining and Online Visibility 439
1 Introduction – “Why measurement of online visibility?” 439
2 (Human) Online search in a changing webgraph 439
2.1 The web as a graph 440
2.2 (Human) Online searching and sur.ng behavior 441
3 Measurement of Online Visibility 441
3.1 Main drivers of Online Visibility 442
3.2 Web data used for our sample 442
3.3 The measure GOVis 443
3.4 Results 444
4 Conclusion and managerial implications 445
References 446
Analysis of Recommender System Usage by Multidimensional Scaling 447
1 Introduction 447
2 Methodology 448
3 Empirical results 449
3.1 The data sets 449
3.2 Representation of products and search profiles 450
3.3 Analysis of system usage 451
4 Summary 453
References 454
On a Combination of Convex Risk Minimization Methods 455
1 Introduction 455
2 Strategy 455
3 Kernel logistic regression and e.support vector regression 458
4 Application 460
5 Discussion 462
Acknowledgments 462
References 462
Credit Scoring Using Global and Local Statistical Models 463
1 Introduction 463
2 Description of the data set 464
3 Global scoring model 464
3.1 Global scoring using logistic discriminant analysis 464
3.2 Classification rule under constraints 465
4 Local scoring by two-stage classification 466
4.1 Clustering using self-organizing maps 467
4.2 K-means cluster analysis 468
4.3 Evaluation of two-stage classi.cation 468
5 Application to the test sample 469
6 Conclusions 470
References 470
Informative Patterns for Credit Scoring: Support Vector Machines Preselect Data Subsets for Linear Discriminant Analysis 471
1 Introduction 471
2 LinearSVMandLDA 472
3 Subset preselection for LDA: Empirical results 475
3.1 About typical and critical subsets 475
3.2 LDA with subset preselection 476
3.3 Comparing SVM, LDA and LDA-SP 476
3.4 Advantages of LDA with subset preselection 477
4 Conclusions 477
References 478
Application of Support Vector Machines in a Life Assurance Environment 479
1 Introduction 479
2 Support vector machines 480
3 Problem context and the data 481
4 A measure of variable importance 482
5 Results 484
References 486
Continuous Market Risk Budgeting in Financial Institutions 487
1 Introduction 487
2 Analysis framework 488
3 Time dimension of risk limits 489
4 Continuous risk budgeting 490
5 Simulation analysis 492
Acknowledgement 493
References 494
Smooth Correlation Estimation with Application to Portfolio Credit Risk 495
1 Introduction 495
2 The sector variable 496
3 Testing for independence 497
4 Model generation 498
5 A one-factor model 499
6 Algebraic approximation 500
7 Impact on the practical performance 501
References 501
A Appendix 502
How Many Lexical-semantic Relations are Necessary? 503
1 Introduction 503
2 Concept calculus 504
3 Diagrammatic representation 506
4 Concept and linguistic sign 509
5 Summary 510
References 510
Automated Detection of Morphemes Using Distributional Measurements 511
1 Overview and introduction 511
2 Why bother with the segmentation of words at all? 512
3 The historical background of research: Distributional analysis 512
4 Basicmethod 513
5 Re.nements of the evaluation 515
6 Transferring graphemic to phonemic representation 516
7 Concluding remarks 517
References 518
Classification of Author and/or Genre? The Impact of Word Length 519
1 Word length and the quantitative description of text(s) and author(s) 519
2 A case study: text basis and analytical options 520
3 Methods of text discrimination 521
3.1 Quantitative measures for text analysis 522
3.2 Discriminant analysis 523
3.3 Statistical distance as a measure for data discrimination 523
4 Summary 526
References 526
Some Historical Remarks on Library Classification – a Short Introduction to the Science of Library Classification 527
1 Introduction 527
2 Classified arrangement in monastery libraries of the Middle Ages 528
3 Classified arrangement in private libraries of the Middle Ages 528
4 Classified arrangement in the late Middle Ages and at the beginning of modern times 529
6 Systematic cataloguing in the 18th century 530
7 Subject cataloguing in the 19th century 530
8 Subject cataloguing in the 20th century 531
References 532
Automatic Validation of Hierarchical Cluster Analysis with Application in Dialectometry 534
1 Introduction 534
2 Pair-wise data clustering 535
3 Resampling techniques based on weights of observations 536
4 Rand’s measure for comparing partitions 536
5 A simulation study 538
6 Application in quantitative linguistics 539
7 Conclusions 540
References 541
Discovering the Senses of an Ambiguous Word by Clustering its Local Contexts 542
1 Introduction 542
2 Approach 543
3 Algorithm 544
4 Results 546
5 Conclusions and prospects 548
Acknowledgements 549
References 549
Document Management and the Development of Information Spaces 550
1 Starting point and task 550
2 Implementation 550
3 Representation of the information space 551
4 Processing .ow text 551
5 Processing partially structured documents 554
6 Summary and outlook 556
References 557
Stochastic Ranking and the Volatility “Croissant”: A Sensitivity Analysis of Economic Rankings 558
1 Introduction 558
2 Index definition and ranking 559
3 Data 561
4 Sensitivity analysis by randomised weights 562
5 Ranking results 563
6 Conclusions 565
References 565
Importance Assessment of Correlated Predictors in Business Cycles Classi.cation 566
1 Problem 566
1.1 Introduction 566
1.2 Measures of importance 567
2 Correlated predictors in regression models 567
2.1 Overview 567
2.2 Orthogonalization 568
3 Correlated predictors in classi.cation models 569
3.1 Orthogonalization 569
3.2 Using a large number of variables 569
3.3 Results for the business cycle model 570
4 Discussion and outlook 571
References 573
Economic Freedom in the 25-Member European Union: Insights Using Classi.cation Tools 574
1 Introduction 574
2 Data description and distance measures 575
2.1 Description of the economic freedom index data 575
2.2 Distance measures 576
3 Cluster analysis methods and cluster patterns 578
3.1 Cluster analysis methods 578
3.2 Empirical cluster patterns 579
4 Conclusion and outlook 581
References 581
Intercultural Consumer Classifications in E-Commerce 582
1 Introduction 582
2 The concept of construction consumer typologies 582
3 Characteristics for constructing typologies relevant for E-Commerce 583
3.1 Requirements regarding criteria used for constructing typologies 583
3.2 Selected constructs for a classi.cation 583
4 Empirical survey of the typology theory 584
4.1 Survey design and data collection 584
4.2 A typology of online customers 585
5 Conclusion 588
References 588
Reservation Price Estimation by Adaptive Conjoint Analysis 590
1 Introduction 590
2 Conjoint analysis for reservation price estimation 591
3 Reservation price estimation based on economic theory 592
4 Application of the method 595
5 Conclusion and further research 596
References 597
Estimating Reservation Prices for Product Bundles Based on Paired Comparison Data 598
1 Introduction 598
2 Gathering data for conjoint measurement 599
2.1 Direct vs. indirect elicitation of reservation prices 599
2.2 Relative direct elicitation of reservation prices 600
3 Study design and application situation 601
4 Results 602
5 Discussion 604
References 604
Classification of Perceived Musical Intervals 606
1 Background 606
2 Experimental setting 608
3 Results 610
4 Conclusion 612
References 613
In Search of Variables Distinguishing Low and High Achievers in Music Sight Reading Task 614
1 Background 614
2 Method 615
3 Results 617
4 Discussion 619
References 620
Automatic Feature Extraction from Large Time Series 621
1 Introduction 621
2 Systematization of statistical methods 622
2.1 Windowing extends the method space 622
2.2 Method trees for feature extraction 623
2.3 Dynamic windowing in method trees 624
3 Automatic feature extraction 625
4 Experiments 626
4.1 Results 627
5 Conclusion 627
References 628
Identification of Musical Instruments by Means of the Hough-Transformation 629
1 The Hough-transform 629
2 Application to sound data 630
2.1 Digital sounds 630
2.2 Motivation: signal edges 630
2.3 Parametrization 631
2.4 Resulting data format 631
3 Classification 632
3.1 Approaches 632
3.2 Data set 633
3.3 Methods 633
3.4 Variable selection 634
3.5 Results 634
3.6 Comparing the results 635
4 Conclusions 636
References 636
Support Vector Machines for Bass and Snare Drum Recognition 637
1 Introduction 637
2 Previous work 638
3 Data gathering 639
4 Descriptors for audio 640
5 Support Vector Machines 641
6 Experiments and results 642
7 Conclusions and future work 643
Acknowledgements 644
References 644
Register Classification by Timbre 645
1 Introduction 645
2 Data 646
3 Classification methods 647
4 Results 648
4.1 Individual tones, voices only 648
4.2 Individual tones, voices and instruments 649
4.3 Averaged tones, voices only 649
4.4 Averaged tones, voices and instruments 649
5 Acoustics 650
6 Conclusion 651
References 652
Classification of Processes by the Lyapunov Exponent 653
1 Introduction 653
2 Lyapunov exponent 654
3 Well-predictable and not-well-predictable processes 656
4 Experimental results 658
5 Conclusion 659
References 659
Desirability to Characterize Process Capability 661
1 Introduction 661
2 Combining capability and desirability - the indices EDU and EDM 663
3 Discussion 665
4 Estimation 666
5 Simulation 667
6 Conclusion 668
References 668
Application and Use of Multivariate Control Charts in a BTA Deep Hole Drilling Process 669
1 Introduction 669
2 Monitoring the process using multiple Residual Shewhart control charts 670
3 Monitoring the process using multivariate control charts 671
3.1 Data depth 671
3.2 A control chart based on sequential rank of data depth measures 672
4 Application 673
4.1 Choice of the control charts parameters 673
4.2 Results 674
4.3 Discussion 675
5 Conclusion 675
Acknowledgements 676
References 676
Determination of Relevant Frequencies and Modeling Varying Amplitudes of Harmonic Processes 677
1 Introduction 677
2 Determination of the distribution of periodogram ordinates 678
3 Regression models on periodogram ordinates 679
3.1 Modelling varying amplitudes 679
3.2 Estimating the variance of e (s2 e ) 680
4 Simulation study on time-varying amplitudes 680
4.1 Design considerations 680
4.2 Results 681
5 Conclusions 684
References 684
Part IV Contest: Social Milieus in Dortmund 686
Introduction to the Contest “Social Milieus in Dortmund” 688
1 Contest goal and data 688
Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 695
1 The problem 695
2 Tackling the problem 695
3 Methods 696
3.1 Fuzzy clustering 696
3.2 Measuring the clustering quality 697
3.3 Defining subgroups of variables 697
3.4 Genetic optimization algorithms 698
3.5 Implementation 699
4 Applying the procedure 699
4.1 The Dortmund data 699
4.2 Results 700
4.3 Comparing the results 701
5 Summary 702
References 702
Annealed k-Means Clustering and Decision Trees 703
1 Introduction 703
2 Preprocessing 704
3 Clustering 704
3.1 Annealed k-means 704
3.2 Learning about k 705
3.3 Solution 706
4 Classification 706
5 Interpretation 708
6 Outlook 709
References 710
Correspondence Clustering of Dortmund City Districts 711
1 Introduction 711
2 Material and methods 712
3 Results 715
4 Conclusion 718
References 718
Keywords 719
Authors 724
Quantitative Assessment of the Responsibility for the Disease Load in a Population (p. 109-110)
Wolfgang Uter and Olaf Gefeller
Department of Medical Informatics, Biometry and Epidemiology,
University of Erlangen Nuremberg, Germany
Abstract. The concept of attributable risk (AR), introduced more than 50 years ago, quantifies the proportion of cases diseased due to a certain exposure (risk) factor. While valid approaches to the estimation of crude or adjusted AR exist, a problem remains concerning the attribution of AR to each of a set of several exposure factors. Inspired by mathematical game theory, namely, the axioms of fairness and the Shapley value, introduced by Shapley in 1953, the concept of partial AR has been developed. The partial AR offers a unique solution for allocating shares of AR to a number of exposure factors of interest, as illustrated by data from the German G¨ottingen Risk, Incidence, and Prevalence Study (G.R.I.P.S.).
1 Introduction
Analytical epidemiological studies aim at providing quantitative information on the association between a certain exposure, or several exposures, and some disease outcome of interest. Usually, the disease etiology under study is multifactorial, so that several exposure factors have to be considered simultaneously. The effect of a particular exposure factor on the dichotomous disease variable is quantified by some measure of association, including the relative risk (RR) or the odds ratio (OR), which will be explained in the next section.
While these measures indicate by which factor the disease risk increases if a certain exposure factor is present in an individual, the concept of attributable risk (AR) addresses the impact of an exposure on the overall disease load in the population. This paper focusses on the AR, which can be informally introduced as the answer to the question, "what proportion of the observed cases of disease in the study population suffers from the disease due to the exposure of interest?". In providing this information the AR places the concept of RR commonly used in epidemiology in a public health perspective, namely by providing an answer also to the reciprocal question, "what proportion of cases of disease could - theoretically - be prevented if the exposure factor could be entirely removed by some adequate preventive action?". Since its introduction in 1953 (Levin (1953)), the concept of AR is increasingly being used by epidemiological researchers.
However, while the One of the diffculties in applying the concept of AR is the question of how to adequately estimate the AR associated with several exposure factors of interest, and not just one single exposure factor. The present paper briefly introduces the concept of sequential attributable risk (SAR) and then focusses on the partial attributable risk (PAR), following an axiomatic approach founded on game theory. For illustrative purposes, data from a German cohort study on risk factors for myocardial infarction are used. methodology of this invaluable epidemiological measure has constantly been extended to cover a variety of epidemiological situations, its practical use has not followed these advances satisfactorily (reviewed by Uter and Pfahlberg (1999)).
Erscheint lt. Verlag | 30.3.2006 |
---|---|
Reihe/Serie | Studies in Classification, Data Analysis, and Knowledge Organization | Studies in Classification, Data Analysis, and Knowledge Organization |
Zusatzinfo | XX, 704 p. 181 illus. |
Verlagsort | Berlin |
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik |
Mathematik / Informatik ► Mathematik ► Statistik | |
Technik | |
Wirtschaft ► Betriebswirtschaft / Management ► Wirtschaftsinformatik | |
Schlagworte | Calculus • classification • Clustering • Data Analysis • service-oriented computing |
ISBN-10 | 3-540-28084-7 / 3540280847 |
ISBN-13 | 978-3-540-28084-2 / 9783540280842 |
Haben Sie eine Frage zum Produkt? |
Größe: 11,4 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich