Methods of Microarray Data Analysis II (eBook)
214 Seiten
Springer US (Verlag)
978-0-306-47598-6 (ISBN)
Currently, there are no standard procedures for the design and analysis of microarray experiments. Methods of Microarray Data Analysis II focuses on a single data set, using a different method of analysis in each chapter. Real examples expose the strengths and weaknesses of each method for a given situation, aimed at helping readers choose appropriate protocols and utilize them for their own data set. In addition, web links are provided to the programs and tools discussed in several chapters. This book is an excellent reference not only for academic and industrial researchers, but also for core bioinformatics/genomics courses in undergraduate and graduate programs.
Written for: Academic and industrial researchers
Microarray technology is a major experimental tool for functional genomic explorations, and will continue to be a major tool throughout this decade and beyond. The recent explosion of this technology threatens to overwhelm the scientific community with massive quantities of data. Because microarray data analysis is an emerging field, very few analytical models currently exist. Methods of Microarray Data Analysis II is the second book in this pioneering series dedicated to this exciting new field. In a single reference, readers can learn about the most up-to-date methods, ranging from data normalization, feature selection, and discriminative analysis to machine learning techniques. Currently, there are no standard procedures for the design and analysis of microarray experiments. Methods of Microarray Data Analysis II focuses on a single data set, using a different method of analysis in each chapter. Real examples expose the strengths and weaknesses of each method for a given situation, aimed at helping readers choose appropriate protocols and utilize them for their own data set. In addition, web links are provided to the programs and tools discussed in several chapters. This book is an excellent reference not only for academic and industrial researchers, but also for core bioinformatics/genomics courses in undergraduate and graduate programs.
Contents 5
Contributors 7
Acknowledgements 9
Preface 10
INTRODUCTION 11
CAMDA 2001 Data Sets 12
Feature Selection and Extraction 12
Clustering Strategies 13
Modeling Complex Systems 14
Ontologies, Semantic Understanding, and Functional Genomics 15
A standard protocol? 16
Web Companion 17
1 AN INTRODUCTION TO DNA MICROARRAYS 18
1. INTRODUCTION TO FUNCTIONAL GENOMICS 18
2. MICROARRAY TECHNOLOGY 19
3. MICROARRAY DATA 22
4. MICROARRAY EXPERIMENT GOALS 23
5. MICROARRAY EXPERIMENTAL DESIGN 25
6. MICROARRAY DATA ANALYSIS 26
7. RESULT VALIDATION 27
7.1 Sample and Data Triage 27
7.2 Statistical Validation 27
7.3 Biological Validation 28
8. CONCLUSION 28
REFERENCES 29
2 EXPERIMENTAL DESIGN FOR GENE MICROARRAY EXPERIMENTS AND DIFFERENTIAL EXPRESSION ANALYSIS 31
1. INTRODUCTION 31
2. DESIGN OF MICROARRAY EXPERIMENTS 32
2.1 Biological variation 33
2.2 Technological variations 34
2.3 Microarray quality checklist 36
3. EXPERIMENTAL DESIGNS THAT INCORPORATE BIOLOGICAL AND TECHNOLOGICAL VARIATION 37
3.1 Block designs 37
3.2 Randomization 38
3.3 Loop designs 38
3.4 Split plot designs 39
3.5 Optimal designs 40
4. DESIGN OF MICROARRAYS 41
5. NORMALIZATION MODELS 42
5.1 Data transformation and background removal 42
5.2 Linear vs. non-linear effects 43
5.3 Random vs. fixed effects 43
5.4 Ordinary least squares vs. orthogonal regression 43
5.5 Means vs. medians 43
5.6 Self-consistency 44
5.7 Flagging outliers 44
6. DIFFERENTIAL EXPRESSION 44
6.1 Error models 45
6.2 Bayesian approach 45
6.3 Adjustment for multiple comparisons and power considerations 46
7. FINAL REMARKS 46
ACKNOWLEDGEMENTS 47
REFERENCES 47
3 MICROARRAY DATA PROCESSING AND ANALYSIS 50
1. INTRODUCTION 50
2. DESIGN OF THE ARRAY 51
3. DATA ACQUISITION AND IMAGE ANALYSIS 53
4. NORMALISATION AND FILTERING 54
5. DATA STORAGE 55
6. ADDRESSING BIOLOGICAL QUESTIONS 57
7. DATA ANALYSIS 58
7.1 Two conditions comparison 58
7.2 Multiple conditions comparison. 58
7.3 Gene networks 64
8. CONCLUSIONS AND FUTURE PROSPECTS 65
REFERENCES 66
4 BIOLOGY-DRIVEN CLUSTERING OF MICROARRAY DATA 71
1. INTRODUCTION 71
2. THE ANNOTATION PROBLEM 72
2.1 Reannotating the Spots 72
2.2 Finding Functional Categories 73
3. PRELIMINARY ANALYSIS 76
3.1 Data Preprocessing 76
3.2 Updating Cell Line Classifications 78
3.3 Choosing a Distance Metric 79
4. CHROMOSOMAL CLUSTERING 80
5. FUNCTIONAL CLUSTERING 82
6. CONCLUSIONS 83
ACKNOWLEDGEMENTS 85
REFERENCES 85
5 EXTRACTING GLOBAL STRUCTURE FROM GENE EXPRESSION PROFILES 86
1. INTRODUCTION 86
2. CLUSTERING WITH NORMALIZED CUT 88
2.1 The NCut Criterion 88
2.2 K-Way Partitioning 89
2.3 Clustering Large Datasets 90
3. RESULTS 90
4. CONCLUSIONS 94
REFERENCES 94
6 SUPERVISED NEURAL NETWORKS FOR CLUSTERING CONDITIONS IN DNA ARRAY DATA AFTER REDUCING NOISE BY CLUSTERING GENE EXPRESSION PROFILES 96
1. INTRODUCTION 96
2. COMPARATIVE PERFORMANCES OF CLUSTERING METHODS 98
2.1 Data set used 98
2.2 Comparative runtimes 98
2.3 Comparative accuracy 100
2.4 Conclusions on comparative performances 101
3. CLUSTERING OF CONDITIONS 102
3.1 The problem of noisy patterns 102
3.2 Clustering of conditions and noise reduction 103
4. CONCLUSIONS 107
ACKNOWLEDGEMENTS 107
REFERENCES 107
7 BAYESIAN DECOMPOSITION ANALYSIS OF GENE EXPRESSION IN YEAST DELETION MUTANTS 109
1. INTRODUCTION 110
1.1 The Development of Cancer 110
1.2 Microarray Measurements and Analysis 110
2. METHODS 112
2.1 Bayesian Decomposition 112
2.2 Issues in the Application of Bayesian Decomposition 117
2.3 Application to the Rosetta Compendium 117
3. RESULTS 119
3.1 Identification of the Patterns 119
3.2 Validation of a Pattern 120
4. CONCLUSIONS 121
ACKNOWLEDGEMENTS 123
REFERENCES 123
8 USING FUNCTIONAL GENOMIC UNITS TO CORROBORATE USER EXPERIMENTS WITH THE ROSETTA COMPENDIUM 127
1. INTRODUCTION 128
2. METHODS 129
2.1 GO Browser 129
2.2 ICA Model of the DNA Microarray 129
2.3 Profiling the yeast cells transfected with constitutive active human Rac1 gene 131
3. RESULTS 131
3.1 GO mapping of yeast genes 131
3.2 ICA Results 133
3.3 Using the Rosetta data set to corroborate the Rac1 Experiment 138
4. DISCUSSION 139
REFERENCES 140
9 FISHING EXPEDITION - A SUPERVISED APPROACH TO EXTRACT PATTERNS FROM A COMPENDIUM OF EXPRESSION PROFILES 142
1. OBJECTIVES 143
2. METHODS 143
2.1 Data Sets 143
2.2 The Algorithms 144
2.3 The approach 146
3. RESULTS 147
4. CONCLUSIONS 151
REFERENCES 152
10 MODELING PHARMACOGENOMICS OF THE NCI-60 ANTICANCER DATA SET: UTILIZING KERNEL PLS TO CORRELATE THE MICROARRAY DATA TO THERAPEUTIC RESPONSES 153
1. INTRODUCTION 154
2. MOTIVATION 155
3. METHODOLOGY 156
4. PERFORMANCE ANALYSES 158
5. DISCUSSION 167
ACKNOWLEDGEMENTS 168
REFERENCES 168
11 ANALYSIS OF GENE EXPRESSION PROFILES AND DRUG ACTIVITY PATTERNS BY CLUSTERING AND BAYESIAN NETWORK LEARNING 170
1. INTRODUCTION 170
2. CLUSTER ANALYSIS OF THE NCI60 DATASET 171
2.1 Soft Topographic Vector Quantization 172
2.2 Clustering of the NCI60 Cell Lines Using STVQ 172
3. DEPENDENCY ANALYSIS USING BAYESIAN NETWORK LEARNING 176
3.1 Bayesian Networks 176
3.2 Applying Bayesian Networks to the Analysis of NCI60 Dataset 177
3.3 Experimental Results 178
4. CONCLUSION AND FUTURE WORK 183
ACKNOWLEDGEMENTS 184
REFERENCES 184
12 EVALUATION OF CURRENT METHODS OF TESTING DIFFERENTIAL GENE EXPRESSION AND BEYOND 186
1. INTRODUCTION 187
2. MATERIALS AND METHODS 187
3. RESULTS 190
4. DISCUSSION 193
ACKNOWLEDGEMENTS 195
REFERENCES 195
13 EXTRACTING KNOWLEDGE FROM GENOMIC EXPERIMENTS BY INCORPORATING THE BIOMEDICAL LITERATURE 196
1. OBJECTIVE 196
2. ANALYTICAL METHODS 197
2.1 Data Sets 197
2.2 Software 198
3. RESULTS 201
4. DISCUSSION 202
4.1 Title Proximity 203
4.2 Genes Linked to the Disease 204
4.3 Genes That Cannot Be Linked to the Disease 205
4.4 Terms That Cannot Be Linked to Any Other Term 206
4.5 Types of Errors 206
4.6 Other Uses for the "Pharma Sentences" 207
4.7 Comparison To Other Tools 208
5. CONCLUSIONS 209
REFERENCES 210
Glossary 211
Index 213
5 EXTRACTING GLOBAL STRUCTURE FROM GENE EXPRESSION PROFILES (p. 81-82)
Charless Fowlkes1, Qun Shan2, Serge Belongie3, and Jitendra Malik1
Departments of Computer Science1 and Molecular Cell Biology 2, University of California at
Berkeley, Department of Computer Science and Engineering, University of California at San Diego3
Abstract: We have developed a program, GENECUT, for analyzing datasets from gene expression profiling. GENECUT is based on a pairwise clustering method known as Normalized Cut (Shi and Malik, 1997). GENECUT extracts global structures by progressively partitioning datasets into well-balanced groups, performing an intuitive k-way partitioning at each stage in contrast to commonly used 2-way partitioning schemes. By making use of the Nyström approximation, it is possible to perform clustering on very large genomic datasets.
Key words: gene expression profiles, clustering analysis, spectral partitioning
1. INTRODUCTION
DNA microarray technology empowers biologists to analyze thousands of mRNA transcripts in parallel, providing insights about the cellular states of tumor cells, the effect of mutations and knockouts, progression of the cell cycle, and reaction to environmental stresses or drug treatments. Gene expression profiles also provide the necessary raw data to interrogate cellular transcription regulation networks. Efforts have been made in identifying cis acting elements based on the assumption that co-regulated genes have a higher probability of sharing transcription factor binding sites. There is a well-recognized need for tools that allow biologists to explore public domain microarray datasets and integrate insights gained into their own research. One important approach for structuring the exploration of gene expression data is to find coherent clusters of both genes and experimental conditions. The association of unknown genes with functionally well-characterized genes will guide the formation of hypotheses and suggest experiments to uncover the function of these unknown genes. Similarly, experimental conditions that cluster together may affect the same regulatory pathway.
Unsupervised clustering is a classical data analysis problem that is still an active area of intensive research in the computer science and statistics communities (Ripley, 1996). Broadly speaking, the goal of clustering is to partition a set of feature vectors into k groups such that the partition is "good" according to some cost function. In the case of genes, the feature vector is usually the degree of induction or suppression over some set of experimental conditions. As of yet, there is no clear consensus as to which algorithms are most suitable for gene expression data.
Clustering methods generally fall into one of two categories: central or pairwise (Buhmann, 1995). Central clustering is based on the idea of prototypes, wherein one finds a small number of prototypical feature vectors to serve as "cluster centers". Feature vectors are then assigned to the most similar cluster center. Pairwise methods are based directly on the distances between all pairs of feature vectors in the data set. Pairwise methods don’t require one to solve for prototypes, which provides certain advantages over central methods. For example, when the shape of the clusters are not simple, compact clouds in feature space, central methods are ill-suited while pairwise methods perform well since similarity is allowed to propagate in a transitive fashion from neighbor to neighbor. A family of genes related by a series of small mutations might well exhibit this sort of structure, particularly when features are based on sequence data. Clustering algorithms can also often be characterized as greedy or global in nature. The agglomerative clustering method used by Eisen et al. (1998) to order microarray data is an example of a greedy pairwise method: it starts with a full matrix of pairwise distances, locates the smallest value, merges the corresponding pair, and repeats until the whole dataset has been merged into a single cluster. Because this type of process only considers the closest pair of data points at each step, global structure present in the data may not be handled properly.
Erscheint lt. Verlag | 8.5.2007 |
---|---|
Sprache | englisch |
Themenwelt | Sachbuch/Ratgeber |
Informatik ► Weitere Themen ► Bioinformatik | |
Mathematik / Informatik ► Mathematik | |
Studium ► 2. Studienabschnitt (Klinik) ► Humangenetik | |
Naturwissenschaften ► Biologie ► Biochemie | |
Naturwissenschaften ► Biologie ► Genetik / Molekularbiologie | |
Technik | |
ISBN-10 | 0-306-47598-7 / 0306475987 |
ISBN-13 | 978-0-306-47598-6 / 9780306475986 |
Haben Sie eine Frage zum Produkt? |
Größe: 13,3 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich