Classification Analysis of DNA Microarrays

Leif E. Peterson (Autor)

Media-Kombination

736 Seiten

2013
John Wiley & Sons Inc
978-0-470-17081-6 (ISBN)

Artikel merken

The rapid uncontrolled growth of classification methods in DNA microarray studies has resulted in a body of information scattered throughout literature, numerous conference proceedings, and others. This book brings together many of the unsupervised and supervised classification methods now dispersed in the literature.

Wiley Series in Bioinformatics: Computational Techniques and Engineering
Yi Pan and Albert Y. Zomaya, Series Editors

Wide coverage of traditional unsupervised and supervised methods and newer contemporary approaches that help researchers handle the rapid growth of classification methods in DNA microarray studies

Proliferating classification methods in DNA microarray studies have resulted in a body of information scattered throughout literature, conference proceedings, and elsewhere. This book unites many of these classification methods in a single volume. In addition to traditional statistical methods, it covers newer machine-learning approaches such as fuzzy methods, artificial neural networks, evolutionary-based genetic algorithms, support vector machines, swarm intelligence involving particle swarm optimization, and more.

Classification Analysis of DNA Microarrays provides highly detailed pseudo-code and rich, graphical programming features, plus ready-to-run source code. Along with primary methods that include traditional and contemporary classification, it offers supplementary tools and data preparation routines for standardization and fuzzification; dimensional reduction via crisp and fuzzy c-means, PCA, and non-linear manifold learning; and computational linguistics via text analytics and n-gram analysis, recursive feature extraction during ANN, kernel-based methods, ensemble classifier fusion.

This powerful new resource:

Provides information on the use of classification analysis for DNA microarrays used for large-scale high-throughput transcriptional studies
Serves as a historical repository of general use supervised classification methods as well as newer contemporary methods
Brings the reader quickly up to speed on the various classification methods by implementing the programming pseudo-code and source code provided in the book
Describes implementation methods that help shorten discovery times

Classification Analysis of DNA Microarrays is useful for professionals and graduate students in computer science, bioinformatics, biostatistics, systems biology, and many related fields.

LEIF E. PETERSON, PHD, is Associate Professor of Public Health, Weill Cornell Medical College, Cornell University, and is with the Center for Biostatistics, The Methodist Hospital Research Institute (Houston). He is a member of the IEEE Computational Intelligence Society, and Editor-in-Chief of the BioMed Central Source Code for Biology and Medicine.

Preface xix

Abbreviations xxiii

1 Introduction 1

1.1 Class Discovery 2

1.2 Dimensional Reduction 4

1.3 Class Prediction 4

1.4 Classification Rules of Thumb 5

1.5 DNA Microarray Datasets Used 9

References 11

Part I Class Discovery 13

2 Crisp K-Means Cluster Analysis 15

2.1 Introduction 15

2.2 Algorithm 16

2.3 Implementation 18

2.4 Distance Metrics 20

2.5 Cluster Validity 24

2.5.1 Davies–Bouldin Index 25

2.5.2 Dunn’s Index 25

2.5.3 Intracluster Distance 26

2.5.4 Intercluster Distance 27

2.5.5 Silhouette Index 30

2.5.6 Hubert’s Statistic 31

2.5.7 Randomization Tests for Optimal Value of K 31

2.6 V-Fold Cross-Validation 35

2.7 Cluster Initialization 37

2.7.1 K Randomly Selected Microarrays 37

2.7.2 K Random Partitions 40

2.7.3 Prototype Splitting 41

2.8 Cluster Outliers 44

2.9 Summary 44

References 45

3 Fuzzy K-Means Cluster Analysis 47

3.1 Introduction 47

3.2 Fuzzy K-Means Algorithm 47

3.3 Implementation 49

3.4 Summary 54

References 54

4 Self-Organizing Maps 57

4.1 Introduction 57

4.2 Algorithm 57

4.2.1 Feature Transformation and Reference Vector Initialization 59

4.2.2 Learning 60

4.2.3 Conscience 61

4.3 Implementation 63

4.3.1 Feature Transformation and Reference Vector Initialization 63

4.3.2 Reference Vector Weight Learning 66

4.4 Cluster Visualization 67

4.4.1 Crisp K-Means Cluster Analysis 67

4.4.2 Adjacency Matrix Method 68

4.4.3 Cluster Connectivity Method 69

4.4.4 Hue–Saturation–Value (HSV) Color Normalization 69

4.5 Unified Distance Matrix (U Matrix) 71

4.6 Component Map 71

4.7 Map Quality 73

4.8 Nonlinear Dimension Reduction 75

References 79

5 Unsupervised Neural Gas 81

5.1 Introduction 81

5.2 Algorithm 82

5.3 Implementation 82

5.3.1 Feature Transformation and Prototype Initialization 82

5.3.2 Prototype Learning 83

5.4 Nonlinear Dimension Reduction 85

5.5 Summary 87

References 88

6 Hierarchical Cluster Analysis 91

6.1 Introduction 91

6.2 Methods 91

6.2.1 General Programming Methods 91

6.2.2 Step 1: Cluster-Analyzing Arrays as Objects with Genes as Attributes 92

6.2.3 Step 2: Cluster-Analyzing Genes as Objects with Arrays as Attributes 94

6.3 Algorithm 96

6.4 Implementation 96

6.4.1 Heatmap Color Control 96

6.4.2 User Choices for Clustering Arrays and Genes 97

6.4.3 Distance Matrices and Agglomeration Sequences 98

6.4.4 Drawing Dendograms and Heatmaps 104

References 105

7 Model-Based Clustering 107

7.1 Introduction 107

7.2 Algorithm 110

7.3 Implementation 111

7.4 Summary 116

References 117

8 Text Mining: Document Clustering 119

8.1 Introduction 119

8.2 Duo-Mining 119

8.3 Streams and Documents 120

8.4 Lexical Analysis 120

8.4.1 Automatic Indexing 120

8.4.2 Removing Stopwords 121

8.5 Stemming 121

8.6 Term Weighting 121

8.7 Concept Vectors 124

8.8 Main Terms Representing Concept Vectors 124

8.9 Algorithm 125

8.10 Preprocessing 127

8.11 Summary 137

References 137

9 Text Mining: N-Gram Analysis 139

9.1 Introduction 139

9.2 Algorithm 140

9.3 Implementation 141

9.4 Summary 154

References 156

Part II Dimension Reduction 159

10 Principal Components Analysis 161

10.1 Introduction 161

10.2 Multivariate Statistical Theory 161

10.2.1 Matrix Definitions 162

10.2.2 Principal Component Solution of R 163

10.2.3 Extraction of Principal Components 164

10.2.4 Varimax Orthogonal Rotation of Components 166

10.2.5 Principal Component Score Coefficients 168

10.2.6 Principal Component Scores 169

10.3 Algorithm 170

10.4 When to Use Loadings and PC Scores 170

10.5 Implementation 171

10.5.1 Correlation Matrix R 171

10.5.2 Eigenanalysis of Correlation Matrix R 172

10.5.3 Determination of Loadings and Varimax Rotation 174

10.5.4 Calculating Principal Component (PC) Scores 176

10.6 Rules of Thumb For PCA 182

10.7 Summary 186

References 187

11 Nonlinear Manifold Learning 189

11.1 Introduction 189

11.2 Correlation-Based PCA 190

11.3 Kernel PCA 191

11.4 Diffusion Maps 192

11.5 Laplacian Eigenmaps 192

11.6 Local Linear Embedding 193

11.7 Locality Preserving Projections 194

11.8 Sammon Mapping 195

11.9 NLML Prior to Classification Analysis 195

11.10 Classification Results 197

11.11 Summary 200

References 203

Part III Class Prediction 205

12 Feature Selection 207

12.1 Introduction 207

12.2 Filtering versus Wrapping 208

12.3 Data 209

12.3.1 Numbers 209

12.3.2 Responses 209

12.3.3 Measurement Scales 210

12.3.4 Variables 211

12.4 Data Arrangement 211

12.5 Filtering 213

12.5.1 Continuous Features 213

12.5.2 Best Rank Filters 219

12.5.3 Randomization Tests 236

12.5.4 Multitesting Problem 237

12.5.5 Filtering Qualitative Features 242

12.5.6 Multiclass Gini Diversity Index 246

12.5.7 Class Comparison Techniques 247

12.5.8 Generation of Nonredundant Gene List 250

12.6 Selection Methods 254

12.6.1 Greedy Plus Takeaway (Greedy PTA) 254

12.6.2 Best Ranked Genes 258

12.7 Multicollinearity 259

12.8 Summary 270

References 270

13 Classifier Performance 273

13.1 Introduction 273

13.2 Input–Output, Speed, and Efficiency 273

13.3 Training, Testing, and Validation 277

13.4 Ensemble Classifier Fusion 280

13.5 Sensitivity and Specificity 283

13.6 Bias 284

13.7 Variance 285

13.8 Receiver–Operator Characteristic (ROC) Curves 286

References 295

14 Linear Regression 297

14.1 Introduction 297

14.2 Algorithm 299

14.3 Implementation 299

14.4 Cross-Validation Results 300

14.5 Bootstrap Bias 303

14.6 Multiclass ROC Curves 306

14.7 Decision Boundaries 308

14.8 Summary 310

References 310

15 Decision Tree Classification 311

15.1 Introduction 311

15.2 Features Used 314

15.3 Terminal Nodes and Stopping Criteria 315

15.4 Algorithm 315

15.5 Implementation 315

15.6 Cross-Validation Results 318

15.7 Decision Boundaries 326

15.8 Summary 327

References 329

16 Random Forests 331

16.1 Introduction 331

16.2 Algorithm 333

16.3 Importance Scores 334

16.4 Strength and Correlation 338

16.5 Proximity and Supervised Clustering 342

16.6 Unsupervised Clustering 345

16.7 Class Outlier Detection 348

16.8 Implementation 350

16.9 Parameter Effects 350

16.10 Summary 357

References 358

17 K Nearest Neighbor 361

17.1 Introduction 361

17.2 Algorithm 362

17.3 Implementation 363

17.4 Cross-Validation Results 364

17.5 Bootstrap Bias 369

17.6 Multiclass ROC Curves 373

17.7 Decision Boundaries 374

17.8 Summary 377

References 378

18 Naїve Bayes Classifier 379

18.1 Introduction 379

18.2 Algorithm 380

18.3 Cross-Validation Results 380

18.4 Bootstrap Bias 384

18.5 Multiclass ROC Curves 386

18.6 Decision Boundaries 386

18.7 Summary 389

References 391

19 Linear Discriminant Analysis 393

19.1 Introduction 393

19.2 Multivariate Matrix Definitions 394

19.3 Linear Discriminant Analysis 396

19.3.1 Algorithm 397

19.3.2 Cross-Validation Results 397

19.3.3 Bootstrap Bias 401

19.3.4 Multiclass ROC Curves 402

19.3.5 Decision Boundaries 403

19.4 Quadratic Discriminant Analysis 403

19.5 Fisher’s Discriminant Analysis 406

19.6 Summary 411

References 412

20 Learning Vector Quantization 415

20.1 Introduction 415

20.2 Cross-Validation Results 417

20.3 Bootstrap Bias 417

20.4 Multiclass ROC Curves 426

20.5 Decision Boundaries 428

20.6 Summary 428

References 430

21 Logistic Regression 433

21.1 Introduction 433

21.2 Binary Logistic Regression 434

21.3 Polytomous Logistic Regression 439

21.4 Cross-Validation Results 443

21.5 Decision Boundaries 444

21.6 Summary 444

References 447

22 Support Vector Machines 449

22.1 Introduction 449

22.2 Hard-Margin SVM for Linearly Separable Classes 449

22.3 Kernel Mapping into Nonlinear Feature Space 452

22.4 Soft-Margin SVM for Nonlinearly Separable Classes 452

22.5 Gradient Ascent Soft-Margin SVM 454

22.5.1 Cross-Validation Results 455

22.5.2 Bootstrap Bias 457

22.5.3 Multiclass ROC Curves 465

22.5.4 Decision Boundaries 465

22.6 Least-Squares Soft-Margin SVM 465

22.6.1 Cross-Validation Results 470

22.6.2 Bootstrap Bias 477

22.6.3 Multiclass ROC Curves 477

22.6.4 Decision Boundaries 477

22.7 Summary 481

References 483

23 Artificial Neural Networks 487

23.1 Introduction 487

23.2 ANN Architecture 488

23.3 Basics of ANN Training 488

23.3.1 Backpropagation Learning 493

23.3.2 Resilient Backpropagation (RPROP) Learning 496

23.3.3 Cycles and Epochs 496

23.4 ANN Training Methods 497

23.4.1 Method 1: Gene Dimensional Reduction and Recursive Feature Elimination for Large Gene Lists 497

23.4.2 Method 2: Gene Filtering and Selection 502

23.5 Algorithm 502

23.6 Batch versus Online Training 504

23.7 ANN Testing 504

23.8 Cross-Validation Results 504

23.9 Bootstrap Bias 506

23.10 Multiclass ROC Curves 506

23.11 Decision Boundaries 513

23.12 RPROP versus Backpropagation 513

23.13 Summary 522

References 522

24 Kernel Regression 525

24.1 Introduction 525

24.2 Algorithm 527

24.3 Cross-Validation Results 527

24.4 Bootstrap Bias 528

24.5 Multiclass ROC Curves 536

24.6 Decision Boundaries 537

24.7 Summary 540

References 542

25 Neural Adaptive Learning with Metaheuristics 543

25.1 Multilayer Perceptrons 544

25.2 Genetic Algorithms 544

25.3 Covariance Matrix Self-Adaptation–Evolution Strategies 549

25.4 Particle Swarm Optimization 556

25.5 ANT Colony Optimization 560

25.5.1 Classification 560

25.5.2 Continuous-Function Approximation 562

25.6 Summary 567

References 567

26 Supervised Neural Gas 573

26.1 Introduction 573

26.2 Algorithm 574

26.3 Cross-Validation Results 574

26.4 Bootstrap Bias 582

26.5 Multiclass ROC Curves 582

26.6 Class Decision Boundaries 584

26.7 Summary 586

References 588

27 Mixture of Experts 591

27.1 Introduction 591

27.2 Algorithm 595

27.3 Cross-Validation Results 596

27.4 Decision Boundaries 597

27.5 Summary 597

References 599

28 Covariance Matrix Filtering 601

28.1 Introduction 601

28.2 Covariance and Correlation Matrices 601

28.3 Random Matrices 602

28.4 Component Subtraction 608

28.5 Covariance Matrix Shrinkage 610

28.6 Covariance Matrix Filtering 613

28.7 Summary 621

References 622

Appendixes 625

A Probability Primer 627

A.1 Choices 627

A.2 Permutations 628

A.3 Combinations 630

A.4 Probability 632

A.4.1 Addition Rule 633

A.4.2 Multiplication Rule and Conditional Probabilities 634

A.4.3 Multiplication Rule for Independent Events 635

A.4.4 Elimination Rule (Disease Prevalence) 636

A.4.5 Bayes’ Rule (Pathway Probabilities) 637

B Matrix Algebra 639

B.1 Vectors 639

B.2 Matrices 642

B.3 Sample Mean, Covariance, and Correlation 647

B.4 Diagonal Matrices 648

B.5 Identity Matrices 649

B.6 Trace of a Matrix 650

B.7 Eigenanalysis 650

B.8 Symmetric Eigenvalue Problem 650

B.9 Generalized Eigenvalue Problem 651

B.10 Matrix Properties 652

C Mathematical Functions 655

C.1 Inequalities 655

C.2 Laws of Exponents 655

C.3 Laws of Radicals 656

C.4 Absolute Value 656

C.5 Logarithms 656

C.6 Product and Summation Operators 657

C.7 Partial Derivatives 657

C.8 Likelihood Functions 658

D Statistical Primitives 665

D.1 Rules of Thumb 665

D.2 Primitives 668

References 678

E Probability Distributions 679

E.1 Basics of Hypothesis Testing 679

E.2 Probability Functions: Source of p Values 682

E.3 Normal Distribution 682

E.4 Gamma Function 686

E.5 Beta Function 689

E.6 Pseudo-Random-Number Generation 692

E.6.1 Standard Uniform Distribution 692

E.6.2 Normal Distribution 693

E.6.3 Lognormal Distribution 694

E.6.4 Binomial Distribution 695

E.6.5 Poisson Distribution 696

E.6.6 Triangle Distribution 697

E.6.7 Log-Triangle Distribution 698

References 698

F Symbols and Notation 699

Index 703

Reihe/Serie	Wiley Series in Bioinformatics
Mitarbeit	Herausgeber (Serie): Yi Pan, Albert Y. Zomaya
Verlagsort	New York
Sprache	englisch
Maße	155 x 234 mm
Gewicht	1157 g
Themenwelt	Informatik ► Weitere Themen ► Bioinformatik
Themenwelt	Naturwissenschaften ► Biologie ► Genetik / Molekularbiologie
ISBN-10	0-470-17081-6 / 0470170816
ISBN-13	978-0-470-17081-6 / 9780470170816
Zustand	Neuware