Data Mining (eBook)

Practical Machine Learning Tools and Techniques, Second Edition

Eibe Frank, Ian H. Witten (Autoren)

eBook Download: PDF

2005 | 2. Auflage
560 Seiten
Elsevier Science (Verlag)
978-0-08-047702-2 (ISBN)

As with any burgeoning technology that enjoys commercial attention, the use of data mining is surrounded by a great deal of hype. Exaggerated reports tell of secrets that can be uncovered by setting algorithms loose on oceans of data. But there is no magic in machine learning, no hidden power, no alchemy. Instead there is an identifiable body of practical techniques that can extract useful information from raw data. This book describes these techniques and shows how they work.

The book is a major revision of the first edition that appeared in 1999. While the basic core remains the same, it has been updated to reflect the changes that have taken place over five years, and now has nearly double the references. The highlights for the new edition include thirty new technique sections, an enhanced Weka machine learning workbench, which now features an interactive interface, comprehensive information on neural networks, a new section on Bayesian networks, plus much more.

* Algorithmic methods at the heart of successful data mining-including tried and true techniques as well as leading edge methods
* Performance improvement techniques that work by transforming the input or output
* Downloadable Weka, a collection of machine learning algorithms for data mining tasks, including tools for data pre-processing, classification, regression, clustering, association rules, and visualization-in a new, interactive interface

Data Mining, Second Edition, describes data mining techniques and shows how they work. The book is a major revision of the first edition that appeared in 1999. While the basic core remains the same, it has been updated to reflect the changes that have taken place over five years, and now has nearly double the references. The highlights of this new edition include thirty new technique sections; an enhanced Weka machine learning workbench, which now features an interactive interface; comprehensive information on neural networks; a new section on Bayesian networks; and much more. This text is designed for information systems practitioners, programmers, consultants, developers, information technology managers, specification writers as well as professors and students of graduate-level data mining and machine learning courses. - Algorithmic methods at the heart of successful data mining including tried and true techniques as well as leading edge methods- Performance improvement techniques that work by transforming the input or output

Cover 1
Foreword 6
Table of contents 8
List of Figures 18
List of Tables 22
Preface 24
Updated and revised content 28
Acknowledgments 30
Part I Machine Learning Tools and Techniques 34
1 What's It All About? 36
1.1 Data mining and machine learning 37
1.2 Simple examples: The weather problem and others 42
1.3 Fielded applications 55
1.4 Machine learning and statistics 62
1.5 Generalization as search 63
1.6 Data mining and ethics 68
1.7 Further reading 70
2 Input: Concepts, Instances, and Attributes 74
2.1 What's a concept? 75
2.2 What's in an example? 78
2.3 What's in an attribute? 82
2.4 Preparing the input 85
2.5 Further reading 93
3 Output: Knowledge Representation 94
3.1 Decision tables 95
3.2 Decision trees 95
3.3 Classification rules 98
3.4 Association rules 102
3.5 Rules with exceptions 103
3.6 Rules involving relations 106
3.7 Trees for numeric prediction 109
3.8 Instance-based representation 109
3.9 Clusters 114
3.10 Further reading 115
4 Algorithms: The Basic Methods 116
4.1 Inferring rudimentary rules 117
4.2 Statistical modeling 121
4.3 Divide-and-conquer: Constructing decision trees 130
4.4 Covering algorithms: Constructing rules 138
4.5 Mining association rules 145
4.6 Linear models 152
4.7 Instance-based learning 161
4.8 Clustering 169
4.9 Further reading 172
5 Credibility: Evaluating What’s Been Learned 176
5.1 Training and testing 177
5.2 Predicting performance 179
5.3 Cross-validation 182
5.4 Other estimates 184
5.5 Comparing data mining methods 186
5.6 Predicting probabilities 190
5.7 Counting the cost 194
5.8 Evaluating numeric prediction 209
5.9 The minimum description length principle 212
5.10 Applying the MDL principle to clustering 216
5.11 Further reading 217
6 Implementations: Real Machine Learning Schemes 220
6.1 Decision trees 222
6.2 Classification rules 233
6.3 Extending linear models 247
6.4 Instance-based learning 268
6.5 Numeric prediction 276
6.6 Clustering 287
6.7 Bayesian networks 304
7 Transformations: Engineering the input and output 318
7.1 Attribute selection 321
7.2 Discretizing numeric attributes 329
7.3 Some useful transformations 338
7.4 Automatic data cleansing 345
7.5 Combining multiple models 348
7.6 Using unlabeled data 370
7.7 Further reading 374
8 Moving on: Extensions and Applications 378
8.1 Learning from massive datasets 379
8.2 Incorporating domain knowledge 382
8.3 Text and Web mining 384
8.4 Adversarial situations 389
8.5 Ubiquitous data mining 391
8.6 Further reading 394
Part II The Weka Machine Learning Workbench 396
9 Introduction to Weka 398
9.1 What's in Weka? 399
9.2 How do you use it? 400
9.3 What else can you do? 401
9.4 How do you get it? 401
10 The Explorer 402
10.1 Getting started 402
10.2 Exploring the Explorer 413
10.3 Filtering algorithms 426
10.4 Learning algorithms 436
10.5 Metalearning algorithms 447
10.6 Clustering algorithms 451
10.7 Association-rule learners 452
10.8 Attribute selection 453
11 The Knowledge Flow Interface 460
11.1 Getting started 460
11.2 The Knowledge Flow components 463
11.3 Configuring and connecting the components 464
11.4 Incremental learning 466
12 The Experimenter 470
12.1 Getting started 471
12.2 Simple setup 474
12.3 Advanced setup 475
12.4 The Analyze panel 476
12.5 Distributing processing over several machines 478
13 The Command-line Interface 482
13.1 Getting started 482
13.2 The structure of Weka 483
13.3 Command-line options 489
14 Embedded Machine Learning 494
14.1 A simple data mining application 494
14.2 Going through the code 495
15 Writing New Learning Schemes 504
15.1 An example classifier 504
15.2 Conventions for implementing classifiers 516
References 518
Index 538
About the Authors 558

Preface

The convergence of computing and communication has produced a society that feeds on information. Yet most of the information is in its raw form: data. If data is characterized as recorded facts, then information is the set of patterns, or expectations, that underlie the data. There is a huge amount of information locked up in databases—information that is potentially important but has not yet been discovered or articulated. Our mission is to bring it forth.

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Of course, there will be problems. Many patterns will be banal and uninteresting. Others will be spurious, contingent on accidental coincidences in the particular dataset used. In addition real data is imperfect: Some parts will be garbled, and some will be missing. Anything discovered will be inexact: There will be exceptions to every rule and cases not covered by any rule. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful.

Machine learning provides the technical basis of data mining. It is used to extract information from the raw data in databases—information that is expressed in a comprehensible form and can be used for a variety of purposes. The process is one of abstraction: taking the data, warts and all, and inferring whatever structure underlies it. This book is about the tools and techniques of machine learning used in practical data mining for finding, and describing, structural patterns in data.

As with any burgeoning new technology that enjoys intense commercial attention, the use of data mining is surrounded by a great deal of hype in the technical—and sometimes the popular—press. Exaggerated reports appear of the secrets that can be uncovered by setting learning algorithms loose on oceans of data. But there is no magic in machine learning, no hidden power, no alchemy. Instead, there is an identifiable body of simple and practical techniques that can often extract useful information from raw data. This book describes these techniques and shows how they work.

We interpret machine learning as the acquisition of structural descriptions from examples. The kind of descriptions found can be used for prediction, explanation, and understanding. Some data mining applications focus on prediction: forecasting what will happen in new situations from data that describe what happened in the past, often by guessing the classification of new examples. But we are equally—perhaps more—interested in applications in which the result of “learning” is an actual description of a structure that can be used to classify examples. This structural description supports explanation, understanding, and prediction. In our experience, insights gained by the applications' users are of most interest in the majority of practical data mining applications; indeed, this is one of machine learning's major advantages over classical statistical modeling.

The book explains a variety of machine learning methods. Some are pedagogically motivated: simple schemes designed to explain clearly how the basic ideas work. Others are practical: real systems used in applications today. Many are contemporary and have been developed only in the last few years.

A comprehensive software resource, written in the Java language, has been created to illustrate the ideas in the book. Called the Waikato Environment for Knowledge Analysis, or Weka1 for short, it is available as source code on the World Wide Web at http://www.cs.waikato.ac.nz/ml/weka. It is a full, industrial-strength implementation of essentially all the techniques covered in this book. It includes illustrative code and working implementations of machine learning methods. It offers clean, spare implementations of the simplest techniques, designed to aid understanding of the mechanisms involved. It also provides a workbench that includes full, working, state-of-the-art implementations of many popular learning schemes that can be used for practical data mining or for research. Finally, it contains a framework, in the form of a Java class library, that supports applications that use embedded machine learning and even the implementation of new learning schemes.

The objective of this book is to introduce the tools and techniques for machine learning that are used in data mining. After reading it, you will understand what these techniques are and appreciate their strengths and applicability. If you wish to experiment with your own data, you will be able to do this easily with the Weka software.

The book spans the gulf between the intensely practical approach taken by trade books that provide case studies on data mining and the more theoretical, principle-driven exposition found in current textbooks on machine learning. (A brief description of these books appears in the Further reading section at the end of Chapter 1.) This gulf is rather wide. To apply machine learning techniques productively, you need to understand something about how they work; this is not a technology that you can apply blindly and expect to get good results. Different problems yield to different techniques, but it is rarely obvious which techniques are suitable for a given situation: you need to know something about the range of possible solutions. We cover an extremely wide range of techniques. We can do this because, unlike many trade books, this volume does not promote any particular commercial software or approach. We include a large number of examples, but they use illustrative datasets that are small enough to allow you to follow what is going on. Real datasets are far too large to show this (and in any case are usually company confidential). Our datasets are chosen not to illustrate actual large-scale practical problems but to help you understand what the different techniques do, how they work, and what their range of application is.

The book is aimed at the technically aware general reader interested in the principles and ideas underlying the current practice of data mining. It will also be of interest to information professionals who need to become acquainted with this new technology and to all those who wish to gain a detailed technical understanding of what machine learning involves. It is written for an eclectic audience of information systems practitioners, programmers, consultants, developers, information technology managers, specification writers, patent examiners, and curious laypeople—as well as students and professors—who need an easy-to-read book with lots of illustrations that describes what the major machine learning techniques are, what they do, how they are used, and how they work. It is practically oriented, with a strong “how to” flavor, and includes algorithms, code, and implementations. All those involved in practical data mining will benefit directly from the techniques described. The book is aimed at people who want to cut through to the reality that underlies the hype about machine learning and who seek a practical, nonacademic, unpretentious approach. We have avoided requiring any specific theoretical or mathematical knowledge except in some sections marked by a light gray bar in the margin. These contain optional material, often for the more technical or theoretically inclined reader, and may be skipped without loss of continuity.

The book is organized in layers that make the ideas accessible to readers who are interested in grasping the basics and to those who would like more depth of treatment, along with full details on the techniques covered. We believe that consumers of machine learning need to have some idea of how the algorithms they use work. It is often observed that data models are only as good as the person who interprets them, and that person needs to know something about how the models are produced to appreciate the strengths, and limitations, of the technology. However, it is not necessary for all data model users to have a deep understanding of the finer details of the algorithms.

We address this situation by describing machine learning methods at successive levels of detail. You will learn the basic ideas, the topmost level, by reading the first three chapters. Chapter 1 describes, through examples, what machine learning is and where it can be used; it also provides actual practical applications. Chapters 2 and 3 cover the kinds of input and output—or knowledge representation—involved. Different kinds of output dictate different styles of algorithm, and at the next level Chapter 4 describes the basic methods of machine learning, simplified to make them easy to comprehend. Here the principles involved are conveyed in a variety of algorithms without getting into intricate details or tricky implementation issues. To make progress in the application of machine learning techniques to particular data mining problems, it is essential to be able to measure how well you are doing. Chapter 5, which can be read out of sequence, equips you to evaluate the results obtained from machine learning, addressing the sometimes complex issues involved in performance evaluation.

At the lowest and most detailed level, Chapter 6 exposes in naked detail the nitty-gritty issues of implementing a spectrum of machine learning algorithms, including the complexities necessary for them to work well in practice. Although many readers may...

Erscheint lt. Verlag	13.7.2005
Sprache	englisch
Themenwelt	Sachbuch/Ratgeber
	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
ISBN-10	0-08-047702-X / 008047702X
ISBN-13	978-0-08-047702-2 / 9780080477022

Haben Sie eine Frage zum Produkt?

PDF (Adobe DRM)

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

CHF 79,95