Text Mining (eBook)
XIII, 373 Seiten
Springer International Publishing (Verlag)
978-3-319-91815-0 (ISBN)
This book discusses text mining and different ways this type of data mining can be used to find implicit knowledge from text collections. The author provides the guidelines for implementing text mining systems in Java, as well as concepts and approaches. The book starts by providing detailed text preprocessing techniques and then goes on to provide concepts, the techniques, the implementation, and the evaluation of text categorization. It then goes into more advanced topics including text summarization, text segmentation, topic mapping, and automatic text management.
Dr. Taeho Jo works as a faculty member for school of game in Hongik University, South Korea. He received his PhD from University of Ottawa in 2006. His research spans text mining, neural networks, machine learning, and information retrieval. He has four years' experience working for industrial organizations and ten years' experience working for in academia. He has published almost 150 research papers, and he was awarded two times in the world wide biography dictionary, 'Marquis Who's Who in the World'.
Dr. Taeho Jo works as a faculty member for school of game in Hongik University, South Korea. He received his PhD from University of Ottawa in 2006. His research spans text mining, neural networks, machine learning, and information retrieval. He has four years’ experience working for industrial organizations and ten years’ experience working for in academia. He has published almost 150 research papers, and he was awarded two times in the world wide biography dictionary, “Marquis Who’s Who in the World”.
Preface 6
Contents 8
Part I Foundation 15
1 Introduction 17
1.1 Definition of Text Mining 17
1.2 Texts 18
1.2.1 Text Components 19
1.2.2 Text Formats 20
1.3 Data Mining Tasks 21
1.3.1 Classification 21
1.3.2 Clustering 23
1.3.3 Association 24
1.4 Data Mining Types 25
1.4.1 Relational Data Mining 26
1.4.2 Web Mining 27
1.4.3 Big Data Mining 28
1.5 Summary 30
2 Text Indexing 32
2.1 Overview of Text Indexing 32
2.2 Steps of Text Indexing 34
2.2.1 Tokenization 34
2.2.2 Stemming 36
2.2.3 Stop-Word Removal 37
2.2.4 Term Weighting 38
2.3 Text Indexing: Implementation 40
2.3.1 Class Definition 40
2.3.2 Stemming Rule 43
2.3.3 Method Implementations 45
2.4 Additional Steps 48
2.4.1 Index Filtering 48
2.4.2 Index Expansion 50
2.4.3 Index Optimization 51
2.5 Summary 53
3 Text Encoding 54
3.1 Overview of Text Encoding 54
3.2 Feature Selection 56
3.2.1 Wrapper Approach 56
3.2.2 Principal Component Analysis 57
3.2.3 Independent Component Analysis 59
3.2.4 Singular Value Decomposition 62
3.3 Feature Value Assignment 63
3.3.1 Assignment Schemes 63
3.3.2 Similarity Computation 65
3.4 Issues of Text Encoding 66
3.4.1 Huge Dimensionality 66
3.4.2 Sparse Distribution 67
3.4.3 Poor Transparency 68
3.5 Summary 70
4 Text Association 72
4.1 Overview of Text Association 72
4.2 Data Association 74
4.2.1 Functional View 74
4.2.2 Support and Confidence 75
4.2.3 Apriori Algorithm 77
4.3 Word Association 79
4.3.1 Word Text Matrix 79
4.3.2 Functional View 81
4.3.3 Simple Example 82
4.4 Text Association 84
4.4.1 Functional View 84
4.4.2 Simple Example 85
4.5 Overall Summary 87
Part II Text Categorization 89
5 Text Categorization: Conceptual View 91
5.1 Definition of Text Categorization 91
5.2 Data Classification 93
5.2.1 Binary Classification 93
5.2.2 Multiple Classification 94
5.2.3 Classification Decomposition 95
5.2.4 Regression 97
5.3 Classification Types 98
5.3.1 Hard vs Soft Classification 98
5.3.2 Flat vs Hierarchical Classification 100
5.3.3 Single vs Multiple Viewed Classification 102
5.3.4 Independent vs Dependent Classification 104
5.4 Variants of Text Categorization 106
5.4.1 Spam Mail Filtering 106
5.4.2 Sentimental Analysis 107
5.4.3 Information Filtering 109
5.4.4 Topic Routing 110
5.5 Summary and Further Discussions 111
6 Text Categorization: Approaches 112
6.1 Machine Learning 112
6.2 Lazy Learning 114
6.2.1 K Nearest Neighbor 115
6.2.2 Radius Nearest Neighbor 117
6.2.3 Distance-Based Nearest Neighbor 118
6.2.4 Attribute Discriminated Nearest Neighbor 120
6.3 Probabilistic Learning 121
6.3.1 Bayes Rule 122
6.3.2 Bayes Classifier 123
6.3.3 Naive Bayes 125
6.3.4 Bayesian Learning 127
6.4 Kernel Based Classifier 129
6.4.1 Perceptron 130
6.4.2 Kernel Functions 131
6.4.3 Support Vector Machine 133
6.4.4 Optimization Constraints 135
6.5 Summary and Further Discussions 137
7 Text Categorization: Implementation 139
7.1 System Architecture 139
7.2 Class Definitions 141
7.2.1 Classes: Word, Text, and PlainText 141
7.2.2 Interface and Class: Classifier and KNearestNeighbor 144
7.2.3 Class: TextClassificationAPI 146
7.3 Method Implementations 147
7.3.1 Class: Word 148
7.3.2 Class: PlainText 149
7.3.3 Class: KNearestNeighbor 151
7.3.4 Class: TextClassificationAPI 152
7.4 Graphic User Interface and Demonstration 155
7.4.1 Class: TextClassificationGUI 155
7.4.2 Preliminary Tasks and Encoding 157
7.4.3 Classification Process 162
7.4.4 System Upgrading 165
7.5 Summary and Further Discussions 166
8 Text Categorization: Evaluation 167
8.1 Evaluation Overview 167
8.2 Text Collections 169
8.2.1 NewsPage.com 169
8.2.2 20NewsGroups 170
8.2.3 Reuter21578 171
8.2.4 OSHUMED 173
8.3 F1 Measure 174
8.3.1 Contingency Table 175
8.3.2 Micro-Averaged F1 176
8.3.3 Macro-Averaged F1 178
8.3.4 Example 180
8.4 Statistical t-Test 181
8.4.1 Student's t-Distribution 181
8.4.2 Unpaired Difference Inference 184
8.4.3 Paired Difference Inference 185
8.4.4 Example 187
8.5 Summary and Further Discussions 188
Part III Text Clustering 190
9 Text Clustering: Conceptual View 191
9.1 Definition of Text Clustering 191
9.2 Data Clustering 192
9.2.1 SubSubsectionTitle 193
9.2.2 Association vs Clustering 194
9.2.3 Classification vs Clustering 195
9.2.4 Constraint Clustering 196
9.3 Clustering Types 197
9.3.1 Static vs Dynamic Clustering 198
9.3.2 Crisp vs Fuzzy Clustering 199
9.3.3 Flat vs Hierarchical Clustering 201
9.3.4 Single vs Multiple Viewed Clustering 202
9.4 Derived Tasks from Text Clustering 204
9.4.1 Cluster Naming 204
9.4.2 Subtext Clustering 205
9.4.3 Automatic Sampling for Text Categorization 207
9.4.4 Redundant Project Detection 208
9.5 Summary and Further Discussions 209
10 Text Clustering: Approaches 210
10.1 Unsupervised Learning 210
10.2 Simple Clustering Algorithms 211
10.2.1 AHC Algorithm 212
10.2.2 Divisive Clustering Algorithm 213
10.2.3 Single Pass Algorithm 214
10.2.4 Growing Algorithm 216
10.3 K Means Algorithm 218
10.3.1 Crisp K Means Algorithm 218
10.3.2 Fuzzy K Means Algorithm 219
10.3.3 Gaussian Mixture 220
10.3.4 K Medoid Algorithm 221
10.4 Competitive Learning 224
10.4.1 Kohonen Networks 224
10.4.2 Learning Vector Quantization 226
10.4.3 Two-Dimensional Self-Organizing Map 227
10.4.4 Neural Gas 229
10.5 Summary and Further Discussions 230
11 Text Clustering: Implementation 232
11.1 System Architecture 232
11.2 Class Definitions 234
11.2.1 Classes in Text Categorization System 234
11.2.2 Class: Cluster 237
11.2.3 Interface: ClusterAnalyzer 239
11.2.4 Class: AHCAlgorithm 240
11.3 Method Implementations 242
11.3.1 Methods in Previous Classes 242
11.3.2 Class: Cluster 244
11.3.3 Class: AHC Algorithm 246
11.4 Class: ClusterAnalysisAPI 247
11.4.1 Class: ClusterAnalysisAPI 248
11.4.2 Class: ClusterAnalyzerGUI 249
11.4.3 Demonstration 251
11.4.4 System Upgrading 252
11.5 Summary and Further Discussions 253
12 Text Clustering: Evaluation 255
12.1 Introduction 255
12.2 Cluster Validations 256
12.2.1 Intra-Cluster and Inter-Cluster Similarities 256
12.2.2 Internal Validation 258
12.2.3 Relative Validation 259
12.2.4 External Validation 261
12.3 Clustering Index 263
12.3.1 Computation Process 263
12.3.2 Evaluation of Crisp Clustering 264
12.3.3 Evaluation of Fuzzy Clustering 265
12.3.4 Evaluation of Hierarchical Clustering 267
12.4 Parameter Tuning 269
12.4.1 Clustering Index for Unlabeled Documents 269
12.4.2 Simple Clustering Algorithm with Parameter Tuning 270
12.4.3 K Means Algorithm with Parameter Tuning 271
12.4.4 Evolutionary Clustering Algorithm 272
12.5 Summary and Further Discussions 273
Part IV Advanced Topics 275
13 Text Summarization 277
13.1 Definition of Text Summarization 277
13.2 Text Summarization Types 278
13.2.1 Manual vs Automatic Text Summarization 279
13.2.2 Single vs Multiple Text Summarization 280
13.2.3 Flat vs Hierarchical Text Summarization 282
13.2.4 Abstraction vs Query-Based Summarization 284
13.3 Approaches to Text Summarization 285
13.3.1 Heuristic Approaches 286
13.3.2 Mapping into Classification Task 287
13.3.3 Sampling Schemes 289
13.3.4 Application of Machine Learning Algorithms 291
13.4 Combination with Other Text Mining Tasks 293
13.4.1 Summary-Based Classification 294
13.4.2 Summary-Based Clustering 295
13.4.3 Topic-Based Summarization 296
13.4.4 Text Expansion 298
13.5 Summary and Further Discussions 299
14 Text Segmentation 301
14.1 Definition of Text Segmentation 301
14.2 Text Segmentation Type 302
14.2.1 Spoken vs Written Text Segmentation 302
14.2.2 Ordered vs Unordered Text Segmentation 304
14.2.3 Exclusive vs Overlapping Segmentation 306
14.2.4 Flat vs Hierarchical Text Segmentation 308
14.3 Machine Learning-Based Approaches 310
14.3.1 Heuristic Approaches 310
14.3.2 Mapping into Classification 311
14.3.3 Encoding Adjacent Paragraph Pairs 313
14.3.4 Application of Machine Learning 315
14.4 Derived Tasks 317
14.4.1 Temporal Topic Analysis 317
14.4.2 Subtext Retrieval 319
14.4.3 Subtext Synthesization 320
14.4.4 Virtual Text 321
14.5 Summary and Further Discussions 322
15 Taxonomy Generation 324
15.1 Definition of Taxonomy Generation 324
15.2 Relevant Tasks to Taxonomy Generation 325
15.2.1 Keyword Extraction 325
15.2.2 Word Categorization 327
15.2.3 Word Clustering 329
15.2.4 Topic Routing 330
15.3 Taxonomy Generation Schemes 332
15.3.1 Index-Based Scheme 332
15.3.2 Clustering-Based Scheme 333
15.3.3 Association-Based Scheme 334
15.3.4 Link Analysis-Based Scheme 336
15.4 Taxonomy Governance 337
15.4.1 Taxonomy Maintenance 337
15.4.2 Taxonomy Growth 339
15.4.3 Taxonomy Integration 340
15.4.4 Ontology 342
15.5 Summary and Further Discussions 344
16 Dynamic Document Organization 346
16.1 Definition of Dynamic Document Organization 346
16.2 Online Clustering 347
16.2.1 Online Clustering in Functional View 347
16.2.2 Online K Means Algorithm 349
16.2.3 Online Unsupervised KNN Algorithm 350
16.2.4 Online Fuzzy Clustering 351
16.3 Dynamic Organization 353
16.3.1 Execution Process 353
16.3.2 Maintenance Mode 354
16.3.3 Creation Mode 355
16.3.4 Additional Tasks 356
16.4 Issues of Dynamic Document Organization 357
16.4.1 Text Representation 358
16.4.2 Binary Decomposition 358
16.4.3 Transition into Creation Mode 359
16.4.4 Variants of DDO System 360
16.5 Summary and Further Discussions 361
References 363
Index 368
Erscheint lt. Verlag | 7.6.2018 |
---|---|
Reihe/Serie | Studies in Big Data | Studies in Big Data |
Zusatzinfo | XIII, 373 p. 236 illus., 148 illus. in color. |
Verlagsort | Cham |
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Datenbanken |
Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik | |
Technik ► Elektrotechnik / Energietechnik | |
Wirtschaft | |
Schlagworte | Automatic text management • Data Mining • Tex mining systems in Java • text categorization • Text Clustering • Text Mining • Text Segmentation • Text Summarization |
ISBN-10 | 3-319-91815-X / 331991815X |
ISBN-13 | 978-3-319-91815-0 / 9783319918150 |
Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
Haben Sie eine Frage zum Produkt? |
Größe: 12,6 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich