Text Mining (eBook)
XII, 237 Seiten
Springer New York (Verlag)
978-0-387-34555-0 (ISBN)
Data mining is a mature technology. The prediction problem, looking for predictive patterns in data, has been widely studied. Strong me- ods are available to the practitioner. These methods process structured numerical information, where uniform measurements are taken over a sample of data. Text is often described as unstructured information. So, it would seem, text and numerical data are different, requiring different methods. Or are they? In our view, a prediction problem can be solved by the same methods, whether the data are structured - merical measurements or unstructured text. Text and documents can be transformed into measured values, such as the presence or absence of words, and the same methods that have proven successful for pred- tive data mining can be applied to text. Yet, there are key differences. Evaluation techniques must be adapted to the chronological order of publication and to alternative measures of error. Because the data are documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modi?ed to accommodate very high dimensions: tens of thousands of words and documents. Still, the central themes are similar.
Preface 5
Audience 6
Supplementary Web Software 6
Acknowledgements 7
Contents 8
1 Overview of Text Mining 12
1.1 What’s Special about Text Mining? 12
1.2 What Types of Problems Can Be Solved? 17
1.3 Document Classification 18
1.4 Information Retrieval 19
1.5 Clustering and Organizing Documents 20
1.6 Information Extraction 21
1.7 Prediction and Evaluation 22
1.8 The Next Chapters 23
1.9 Historical and Bibliographical Remarks 24
2 From Textual Information to Numerical Vectors 25
2.1 Collecting Documents 25
2.2 Document Standardization 28
2.3 Tokenization 30
2.4 Lemmatization 31
2.5 Vector Generation for Prediction 35
2.6 Sentence Boundary Determination 46
2.7 Part-Of-Speech Tagging 47
2.8 Word Sense Disambiguation 49
2.9 Phrase Recognition 49
2.10 Named Entity Recognition 50
2.11 Parsing 50
2.12 Feature Generation 52
2.13 Historical and Bibliographical Remarks 54
3 Using Text for Prediction 57
3.1 Recognizing that Documents Fit a Pattern 59
3.2 How Many Documents Are Enough? 61
3.3 Document Classification 62
3.4 Learning to Predict from Text 64
3.5 Evaluation of Performance 87
3.6 Applications 91
3.7 Historical and Bibliographical Remarks 92
4 Information Retrieval and Text Mining 95
4.1 Is Information Retrieval a Form of Text Mining? 95
4.2 Key Word Search 97
4.3 Nearest-Neighbor Methods 98
4.4 Measuring Similarity 99
4.5 Web-Based Document Search 102
4.6 Document Matching 107
4.7 Inverted Lists 108
4.8 Evaluation of Performance 110
4.9 Historical and Bibliographical Remarks 111
5 Finding Structure in a Document Collection 113
5.1 Clustering Documents by Similarity 116
5.2 Similarity of Composite Documents 117
5.3 What Do a Cluster’s Labels Mean? 130
5.4 Applications 132
5.5 Evaluation of Performance 133
5.6 Historical and Bibliographical Remarks 136
6 Looking for Information in Documents 139
6.1 Goals of Information Extraction 139
6.2 Finding Patterns and Entities from Text 142
6.3 Coreference and Relationship Extraction 155
6.4 Template Filling and Database Construction 159
6.5 Applications 161
6.6 Historical and Bibliographical Remarks 164
7 Case Studies 167
7.1 Market Intelligence from the Web 167
7.2 Lightweight Document Matching for Digital Libraries 173
7.3 Generating Model Cases for Help Desk Applications 177
7.4 Assigning Topics to News Articles 182
7.5 E-mail Filtering 188
7.6 Search Engines 192
7.7 Extracting Named Entities from Documents 196
7.8 Customized Newspapers 201
7.9 Historical and Bibliographical Remarks 204
8 Emerging Directions 206
8.1 Summarization 207
8.2 Active Learning 210
8.3 Learning with Unlabeled Data 211
8.4 Different Ways of Collecting Samples 212
8.5 Question Answering 217
8.6 Historical and Bibliographical Remarks 219
Appendix: Software Notes 221
A. 1 Summary of Software 221
A.2 Requirements 222
A.3 Download Instructions 223
References 224
Author Index 236
Subject Index 240
Erscheint lt. Verlag | 8.1.2010 |
---|---|
Zusatzinfo | XII, 237 p. |
Verlagsort | New York |
Sprache | englisch |
Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
Mathematik / Informatik ► Informatik ► Web / Internet | |
Schlagworte | Active learning • classification • Clustering • Clustering and matching • Data Mining • Document classification and correction • extraction • Information Retrieval • Retrieval • Summarization • Text Mining |
ISBN-10 | 0-387-34555-8 / 0387345558 |
ISBN-13 | 978-0-387-34555-0 / 9780387345550 |
Haben Sie eine Frage zum Produkt? |
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich