Blick ins Buch

Scientific Data Mining and Knowledge Discovery (eBook)

Principles and Foundations

Mohamed Medhat Gaber (Herausgeber)

eBook Download: PDF

2009 | 2010
X, 400 Seiten
Springer Berlin (Verlag)
978-3-642-02788-8 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

Mohamed Medhat Gaber 'It is not my aim to surprise or shock you - but the simplest way I can summarise is to say that there are now in the world machines that think, that learn and that create. Moreover, their ability to do these things is going to increase rapidly until - in a visible future - the range of problems they can handle will be coextensive with the range to which the human mind has been applied' by Herbert A. Simon (1916-2001) 1Overview This book suits both graduate students and researchers with a focus on discovering knowledge from scienti c data. The use of computational power for data analysis and knowledge discovery in scienti c disciplines has found its roots with the re- lution of high-performance computing systems. Computational science in physics, chemistry, and biology represents the rst step towards automation of data analysis tasks. The rational behind the developmentof computationalscience in different - eas was automating mathematical operations performed in those areas. There was no attention paid to the scienti c discovery process. Automated Scienti c Disc- ery (ASD) [1-3] represents the second natural step. ASD attempted to automate the process of theory discovery supported by studies in philosophy of science and cognitive sciences. Although early research articles have shown great successes, the area has not evolved due to many reasons. The most important reason was the lack of interaction between scientists and the automating systems.

Contents 6
Contributors 8
Introduction 10
1 Overview 10
2 Book Organization 11
3 Final Remarks 12
References 13
Part I Background 14
Machine Learning 15
1 Introduction 16
The Learner's Way of Interaction 16
2 General Preliminaries for Learning Concepts from Examples 17
2.1 Representing Training Data 18
2.2 Learning Algorithms 19
2.3 Objects, Concepts and Concept Classes 19
2.4 Consistent and Complete Concepts 20
3 Generalisation as Search 20
4 Learning of Classification Rules 24
4.1 Model-Based Learning Approaches: The AQ Family 24
The AQ algorithm 25
Learning a Single Class 26
4.2 Non-Boolean Attributes 27
4.3 Problems and Further Possibilities of the AQ Framework 29
Searching for Extensions of a Rule 29
Learning Multiple Classes 29
Learning Relational Concepts Using the AQ Approach 30
Extending AQ 30
5 Learning Decision Trees 30
5.1 Representing Functions in Decision Trees 30
5.2 The Learning Process 31
The Termination Condition: T(S) 33
The Evaluation Function: ev(a, S) 33
5.3 Representational Aspects in Decision Tree Learning 37
Continuous Attributes 37
Unknown Attribute Values 37
Splitting Strategies 37
5.4 Over-Fitting and Tree Pruning 38
6 Inductive Logic Programming 38
Relative Least General Generalisation 41
Inverse Resolution 42
Specialisation Techniques 45
6.1 Biases in ILP Systems 46
Language Bias 47
Search Bias 48
6.2 Discussion on Inductive Logic Programming 48
7 Association Rules 49
7.1 The Apriori Algorithm 51
7.2 Discussion on Association Rules 52
8 Naive Bayesian Classifiers 53
9 Improving Classifiers by Using Committee Machines 54
9.1 Bagging 55
9.2 AdaBoost 55
10 Discussion on Learning 55
References 57
Statistical Inference 61
1 Introduction 61
2 Approaches to Statistical Inference 62
2.1 Parametric Inference 62
Generalized Linear Model 63
Special Cases 64
Generalizing Symmetric Distribution 64
2.2 Nonparametric Inference 65
2.3 Prediction Analysis 65
3 Classical Inference 66
3.1 Inference for Multiple Regression Model 66
Improved Estimators 68
3.2 Estimation for Multivariate Non-Normal Models 69
3.3 Prediction Distribution for Elliptic Models 72
3.4 Tolerance Region for Elliptic Models 74
4 Bayesian Inference 76
4.1 Bayesian Philosophy 76
4.2 Estimation 77
4.3 Prediction Distribution 79
5 Nonparametric Methods 80
5.1 Estimation 80
5.2 Test of Hypothesis 82
References 83
The Philosophy of Science and its relation to Machine Learning 85
1 Introduction 85
2 What is the Philosophy of Science? 86
3 Hypothesis Choice and Model Selection 87
4 Inductivism Versus Falsificationism 88
5 Bayesian Epistemology and Causality 90
6 Evidence Integration 93
Example: Cancer Prognosis 94
7 Conclusion 96
References 97
Concept Formation in Scientific Knowledge Discovery from a Constructivist View 98
1 Introduction 98
2 Concept Formation from a Constructivist View 99
2.1 Knowledge as Generalization 99
2.2 Knowledge as Construction 100
2.3 Experiential Learning 102
2.4 Concept Formation in Constructive Memory 104
2.5 From First Person Construct to Third Person Knowledge 107
3 Concept Formation in a Hydrologic Modeling Scenario 108
4 Challenges for Designing Scientific Knowledge Discovery Tools 111
5 Conclusion 113
References 113
Knowledge Representation and Ontologies 117
1 Knowledge Representation 117
1.1 Principles of Representing Knowledge 117
Forms of Representing Knowledge 117
Reasoning about Knowledge 120
1.2 Logical Knowledge Representation Formalisms 122
Classical Model-Theoretic Semantics 122
Description Logics 123
Logic Programming 125
1.3 Knowledge Representation Paradigms 126
Open-World vs. Closed-World View 126
Clear-Cut Predication vs. Metamodeling 128
Conceptual Modeling vs. Rules 128
2 Ontologies 129
2.1 Notion of an Ontology 129
2.2 Ontologies in Information Systems 132
Appearance of Ontologies 132
Utilisation of Ontologies 135
2.3 Semantic Web and Ontology Languages 138
The Semantic Web Vision 138
Semantic Web Ontology Languages 139
3 Outlook 141
References 142
Part II Computational Science 144
Spatial Techniques 145
1 Introduction 145
2 Global Positioning System 145
2.1 What is GPS? 145
2.2 Method of Operation 147
2.3 Technical Description 148
Space Segment 148
Control Segment 149
User Segment 150
2.4 Navigation Signals 150
2.5 GPS Error Sources and Jamming 151
2.6 Improving the Accuracy of GPS Signals 154
3 Remote Sensing 155
3.1 What is Remote Rensing? 155
3.2 Principal Process of Remote Sensing 155
Energy Source 155
Radiation and Atmosphere 157
Interaction with the Target 157
Image 157
Analysis and Interpretation 158
3.3 Types of Remote Sensing 159
Optical and Infrared Remote Sensing 159
Microwave Remote Sensing 160
Airborne Remote Sensing 160
Spaceborne Remote Sensing 161
3.4 Image Processing 161
4 GIS: Geographic Information System 165
4.1 What is a GIS? 165
4.2 Data Capture and Creation 166
4.3 Types of GIS Data Models 167
4.4 Spatial Analysis with GIS 170
Spatial Overlay 171
Contiguity Analysis 172
Surface Analysis 173
Raster Analysis 173
Network Analysis 173
4.5 Data Output and Cartography 174
4.6 GIS Software 174
References 175
Computational Chemistry 177
1 Introduction 177
1.1 Molecular Multi-Center Integrals Over Exponential Type Functions 177
1.2 Nonlinear Transformations and Extrapolation Methods 179
1.3 Extrapolation Methods and Molecular Integrals over ETFs 180
2 General Definitions and Properties 181
3 Molecular Integrals and the Fourier Transform 185
3.1 Three-Center Nuclear Attraction Integrals 185
3.2 Hybrid and Three-Center Two-Electron Coulomb Integrals 185
3.3 Four-Center Two-Electron Coulomb Integrals 187
3.4 Analytic Development 188
4 Nonlinear Transformations and Extrapolation Methods 193
4.1 The Nonlinear Transformation 193
4.2 The S Transformation 194
4.3 Recurrence Relations for the S Transformation 199
5 Numerical Treatment of ETFs Molecular Multi-Center Integrals 201
6 Conclusion and Numerical Tables 203
References 209
String Mining in Bioinformatics 211
1 Introduction 211
2 Background 212
Intrinsic Strategy 212
Comparative Strategy 213
Data Structures and Algorithms for String Mining 214
3 Basic Definitions and Data Structures 214
3.1 Basic Notions 214
3.2 Look-Up Table 215
3.3 Automata and Tries 216
3.4 The Suffix Tree 217
4 Repeat-Related Problems 218
4.1 Identifying Dispersed Repeats 219
Identifying Exact Fixed Length Repeats 219
Finding Variable Length Repeats 220
4.2 Identifying Tandem Repeats 223
4.3 Identifying Unique Subsequences 224
4.4 Finding Absent Words 224
4.5 Approximate Repeats 225
4.6 Constraints on the Repeats 225
5 Sequence Comparison 227
5.1 Global and Semi-Global Exact Pattern Matching 227
5.2 Local Exact Matches 229
Finding Exact k-mers 230
Maximal Exact Matches 231
5.3 Approximate String Comparison and Sequence Alignment 232
The Global Alignment Algorithm 233
The Local Alignment Algorithm 235
Semi-Global Sequence Alignment 236
Variations of Approximate Matching 237
Heuristics for sequence alignment 239
6 Applications of String Mining in Other Areas 243
6.1 Applying the Trie to Finding Frequent Itemsets 243
6.2 Application of String Comparison to a Data Mining Technique: Computing String Kernels 246
6.3 String Mining in Unstructured and Semi-Structured Text Documents 247
7 Conclusions 248
References 249
Part III Data Mining and Knowledge Discovery 252
Knowledge Discovery and Reasoning in Geospatial Applications 253
1 Introduction 253
2 Data Characteristics in Geospatial Data Mining 254
3 Spatial Data Mining Techniques 256
3.1 Classical Data Mining Techniques 256
3.2 Specialized Spatial Data Mining Techniques 257
Clustering 257
Classification 258
Association Rules 259
Outlier Detection 260
4 Applications 261
4.1 Application of GDM in Business 262
4.2 Traffic Networks 263
4.3 Application to Earth Observation and Change Detection 263
5 Research Challenges 264
5.1 Geospatial Databases 264
5.2 Issues in Geospatial Algorithms 266
5.3 Geospatial Data Mining 266
6 Conclusion 267
References 268
Data Mining and Discovery of Chemical Knowledge 271
1 Data-Mining-Based Prediction on Formation of Ternary Intermetallic Compounds 271
1.1 Data-Mining-Based Prediction on Formation of Ternary Intermetallic Compounds Between Nontransition Elements 271
1.2 Methods of Investigation 272
Atomic Parameters Used for Data Mining 272
Methods of Computation 272
Data Files for Training and Model Building 273
1.3 Results and Discussion 273
Intermetallic Compounds Between Nontransition Elements 273
Intermetallic Compounds Between Transition Elements 277
Intermetallic Compounds Between One Nontransition Element and Two Transition Elements 282
Intermetallic Compounds Between Two Nontransition Elements and One Transition Element 286
Regularities of Formation of Ternary Compounds Containing Cu-Group Elements 289
2 Data-Mining-Based Prediction on Structure--Activity Relationships of Drugs 291
2.1 Methodology 292
Support Vector Classification 292
Support Vector Regression (SVR) 294
Implementation of Support Vector Machine (SVM) 295
2.2 Using Support Vector Classification for Anti-HIV-1 Activities of HEPT-Analog Compounds 295
Data set 296
Descriptors 296
Selection of the Kernel Function and the Capacity Parameter C 297
Modeling of SVC 299
Results of cross validation test 299
2.3 Predeicting Aldose Reductase Inhibitory Activitiesof Flavones Using Support Vector Regression 300
Data Set 301
Selection of Kernel Function and the Molecular Descriptors 301
Optimization of SVR parameters 304
Results of LOOCV of SVR and Discussion 304
2.4 Conclusion 307
3 Data-Mining-Based Industrial Optimization System for Chemical Process 307
3.1 Methodology 309
3.2 DMOS: A Software System Based on Data Mining for Chemical Process Optimization and Monitoring 309
Description of the Off-line DMOS Version 310
Description of the on-line DMOS version 312
Case Study of DMOS: Optimization of Ammonia Synthesis Process 313
Discussions and Conclusions 316
References 317
Data Mining and Discovery of Astronomical Knowledge 320
1 Introduction 320
1.1 Problem Statement 323
1.2 Contributions 323
1.3 Chapter Outline 324
2 Basic Definitions and Concepts 324
3 Mining Complex Co-location Rules 326
3.1 Data Preparation 327
3.1.1 Data Extraction 327
Data Transformation 329
New Attributes Creation 329
Galaxies Categorization 330
3.2 Mining Maximal Cliques 330
GridClique Algorithm 330
GridClique Algorithm Analysis 334
3.3 Extracting Complex Relationships 335
3.4 Mining Interesting Complex Relationships 336
4 Experiments and Results 336
4.1 Experimental Setup 336
4.2 Results 337
Galaxy Types in Large Maximal Cliques 337
Cliques Cardinalities 337
GridClqiue Performance 337
Interesting Rules from SDSS 340
5 Summary 341
References 342
Part IV Future Trends 343
On-board Data Mining 344
1 Problems Encountered with On-Board Mining 345
1.1 Power 346
1.2 Bandwidth 347
1.3 Computation 348
1.4 Storage and Memory 349
1.5 Georeferencing 349
2 The Use of Standards for On-Board Data Mining 350
2.1 SensorML 350
2.2 Describing Sensors in SensorML 351
2.3 Sensor Platforms 352
2.4 Measurements and Response 353
3 The Use of FPGAs for On-Board Systems 353
4 Applications for On-Board Data Mining 355
4.1 Autonomy 356
4.2 EO-1 356
4.3 Mars Odyssey 358
4.4 Mars Rover 360
4.5 Deep Space 362
5 Unmanned Vehicles 363
6 Biometrics 365
7 Sensor Networks 370
8 Conclusion 371
References 371
Data Streams: An Overview and Scientific Applications 376
1 Introduction 376
2 Stream Management Issues 377
3 Stream Mining Algorithms 378
3.1 Data Stream Clustering 378
3.2 Data Stream Classification 380
3.3 Frequent Pattern Mining 381
3.4 Change Detection in Data Streams 383
3.5 Synopsis Construction in Data Streams 384
3.6 Dimensionality Reduction and Forecasting in Data Streams 390
3.7 Distributed Mining of Data Streams 390
4 Scientific Applications of Data Streams 391
4.1 Network Monitoring 391
4.2 Intrusion Detection 391
4.3 Sensor Network Analysis 392
4.4 Cosmological Applications 392
4.5 Mobile Applications 393
4.6 Environmental and Weather Data 393
5 Conclusions and Research Directions 393
References 394
Index 397

Erscheint lt. Verlag	19.9.2009
Zusatzinfo	X, 400 p.
Verlagsort	Berlin
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
	Naturwissenschaften ► Chemie
	Technik
Schlagworte	Bioinformatics • Data Analysis • Data Mining • Data streams • Evolution • Knowledge • Knowledge Discovery • Knowledge Representation • learning • machine learning • Ontologies • Ontology • Philosophy • Science • Scientific Computation • Statistical Inference • Statistics
ISBN-10	3-642-02788-1 / 3642027881
ISBN-13	978-3-642-02788-8 / 9783642027888

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 8,1 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

CHF 149,75