Blick ins Buch

Text Mining (eBook)

From Ontology Learning to Automated Text Processing Applications

Chris Biemann, Alexander Mehler (Herausgeber)

eBook Download: PDF

2014 | 2014
X, 238 Seiten
Springer International Publishing (Verlag)
978-3-319-12655-5 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects the most recent achievements with respect to the automatic build-up of large lexical resources. It addresses researchers that already perform text mining, and those who want to enrich their battery of methods. Selected articles can be used to support graduate-level teaching.

The book is suitable for all readers that completed undergraduate studies of computational linguistics, quantitative linguistics, computer science and computational humanities. It assumes basic knowledge of computer science and corpus processing as well as of statistics.

After completing his doctoral dissertation with Gerhard Heyer at the University of Leipzig (Germany), Chris Biemann joined the semantic search startup Powerset (San Francisco) in 2008, which was acquired to become part of Microsoft's Bing in the same year. In 2011, he joined TU Darmstadt (Germany) as an assistant professor (W1) for Language Technology. His interests are situated in statistical semantics, unsupervised and knowledge-free natural language processing and in leveraging the wisdom of the crowds for language data acquisition. Alexander Mehler is professor (W3) for Computational Humanities / Text Technology at the Goethe University Frankfurt am Main, where he heads the Text Technology Lab as part of the Institute of Informatics. His research interests focus on the empirical analysis and simulative synthesis of discourse units in spoken and written communication. He aims at a quantitative theory of networking in linguistic systems to enable multi-agent simulations of their life cycle. Alexander Mehler integrates models of semantic spaces with simulation models of language evolution and topological models of network theory to capture the complexity of linguistic information systems. Currently, he is heading several research projects on the analysis of linguistic networks in historical semantics. Most recently he started a research project on kinetic text-technologies that integrates the paradigm of games with a purpose with the wiki way of collaborative writing and kinetic HCI.

Foreword 6
List of Reviewers 8
Contents 10
Part I Text Mining Techniques and Methodologies 12
Building Large Resources for Text Mining: The Leipzig Corpora Collection 13
1 Introduction: The Need for Large Resources 13
1.1 What is the Right Size of a Corpus? 13
1.2 How Much Text is There for a Certain Language? 14
2 Standardization and Availability 15
2.1 Standardized Processing 15
2.1.1 Crawling 15
2.1.2 Pre-processing 17
2.2 Standardization in Distributed Infrastructures 18
3 The Leipzig Corpora Collection 19
3.1 Evolution of the LCC 19
3.2 Deep Processing 21
3.2.1 Word Co-occurrences 22
3.2.2 POS Tagging 22
3.2.3 Word Similarities 22
3.2.4 Sentence Similarities 23
3.3 Language and Corpus Statistics 25
3.3.1 Quality 25
3.3.2 Corpus Timeline 27
3.3.3 Language Description 27
3.3.4 Application to Typology 29
3.4 Multiword Units 31
3.5 Recent Developments and Future Trends 32
References 33
Learning Textologies: Networks of Linked Word Clusters 35
1 Introduction 35
2 Related Work 38
3 Building Textologies 38
3.1 Word Association Graph 40
3.2 Algorithms 41
3.2.1 The Cluster Expansion Algorithm 41
3.2.2 Semantic Context Learning 43
3.2.3 Link Detection Algorithm 43
4 Using Textologies 44
4.1 From Textologies to Ontologies 44
4.2 Grammar Generation 45
5 Experiments and Evaluation 46
5.1 Building a Textology 46
5.2 Generating Grammars 48
6 Conclusion 49
References 49
Simple, Fast and Accurate Taxonomy Learning 51
1 Introduction 51
2 Related Work 52
3 Taxonomy Term Extraction 53
3.1 Hyponym Extraction and Filtering 54
3.2 Hypernym Extraction and Filtering 55
3.3 Concept Positioning Test 56
4 Taxonomy Induction 56
4.1 Positioning Intermediate Concepts 56
4.2 Graph-Based Concept Reordering 57
5 Taxonomy Enrichment with Verb-Based Relations 58
5.1 Problem Formulation 58
5.2 Learning Verb Relations 59
5.3 Learning Verb–Preposition Relations 60
6 Data Collection and Experimental Set Up 60
6.1 Experiment 1: Hyponym Extraction 61
6.2 Experiment 2: Hypernym Extraction 62
6.3 Experiment 3: IS-A Taxonomic Relations 63
6.4 Experiment 4: Reconstructing WordNet's Taxonomy 63
6.5 Experiment 5: Taxonomy Verb-Based Enrichment 66
7 Conclusion 69
References 70
A Topology-Based Approach to Visualize the Thematic Composition of Document Collections 73
1 Introduction 73
2 Related Work 75
2.1 Visualization of High-Dimensional Point Data 75
2.2 Representation and Visualization of Textual Data 77
3 Pitfalls of Distance-Based Analysis and Projective Visualization 77
3.1 Distance-Based Analysis in High-Dimensional Spaces 78
3.2 Projections to Visualize High-Dimensional Clusterings 79
3.3 Rethinking: How to Present What to the User 80
4 Topological Representation of Clustering Structure 81
4.1 From Point Data to a High-Dimensional Density Function 82
4.2 The Topology of the Density Function 83
4.3 Cluster Properties and Algorithm Parameters 85
5 Visualization of High-Dimensional Point Cloud Structure 86
5.1 Topological Landscape Metaphor 87
5.1.1 Atoll-Like Flattened Topological Landscape 88
5.2 Topological Landscape Profile 88
5.3 Feature Selection and Local Data Analysis 90
5.4 Parameter Widgets 92
6 Conclusion 93
References 94
Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts Using the TTLab Latin Tagger 96
1 Introduction 96
2 Processing Latin Texts with the TTLab Latin Tagger 98
2.1 Linguistic Rules 99
2.2 Statistical PoS-Tagging with Conditional Random Fields 101
2.3 Evaluation 102
3 Extending the Frankfurt Latin Lexicon (FLL) 103
4 From Tagging Latin Texts to Lexical Text Networks 104
4.1 Approaching Lexical Text Structures by Means of k-Cores 105
5 Experimentation 110
6 Conclusion 118
References 118
Part II Text Mining Applications 122
A Structuralist Approach for Personal Knowledge Exploration Systems on Mobile Devices 123
1 Introduction 123
1.1 The Structuralist Approach and Personal Data 125
1.2 Mobile Devices 127
2 Our Solution 127
2.1 Pledge for Additional ``Language'' Layers 128
3 Text Similarity Measurement 129
3.1 Evaluation Method 131
3.2 Experiments on News and Email Text Collections 132
3.3 (Unofficial) Semantic Text Similarity Experiments 134
4 Information Extraction 135
4.1 (Named) Entity Recognition 136
4.2 Integration of Personal Resources 137
4.2.1 Address Book 138
4.2.2 Exploiting the Personal Corpus 138
4.2.3 Combining Precomputed NER Models with Personal Models 139
5 Conclusions 140
References 142
Natural Language Processing Supporting Interoperabilityin Healthcare 145
1 Introduction 145
2 Methods 146
2.1 The Interoperability Challenge 147
2.2 The Language Challenge 150
2.3 The Natural Language Processing Challenge 152
3 Towards NLP in Healthcare 153
3.1 Round Trip 154
3.2 Phases of NLP 154
3.3 From Speech to Text Fields 155
3.4 From Text to Codes 156
3.5 From Codes to Structured Data 157
3.6 Importing Data 159
3.7 Data Exchange 160
3.8 Semantic Translations 161
4 Discussion 162
References 162
Deception Detection Within and Across Cultures 165
1 Introduction 165
1.1 Related Work 166
2 Datasets 167
3 Experiments 170
3.1 What is the Performance for Deception Classifiers Built for Different Cultures? 170
3.2 Can We Use Information Drawn from One Culture to Build a Deception Classifier in Another Culture? 173
3.3 What are the Psycholinguistic Classes Most Strongly Associated with Deception/Truth? 174
4 Deception Detection Using Short Sentences 178
5 Conclusions 182
References 182
Sentiment Analysis: What's Your Opinion? 184
1 Introduction 184
2 The Counterpart of Sentiment in Linguistics and Psychology 186
2.1 Subjectivity 186
2.1.1 The `Private State' 187
2.1.2 Emotions and Their Reflection in Language 188
2.1.3 Intersubjectivity 190
2.2 Factuality 191
2.2.1 The Semantic Viewpoint: Evidentiality and Veridicity 191
2.2.2 Interpretation 191
3 Sentiment Analysis in Computational Linguistics 192
3.1 Resources: Lexicons and Corpora 193
3.2 Rule-Based Approaches 195
3.3 Aspect Analysis 196
3.4 Machine Learning Approaches 197
4 What Is Your Opinion, What Is Ours? 199
4.1 Terminology 199
4.2 Issues (1): Polarity and Lexicons 200
4.3 Issues (2): Context 201
5 Summary 204
References 205
Multi-perspective Event Detection in Texts Documentingthe 1944 Battle of Arnhem 207
1 Introduction 207
2 Synthesizing Computational and Historical Research Practices 209
3 About MERIT 212
3.1 Proof of Concept Study: The Battle of Arnhem 212
3.2 Methodology 214
4 A Pilot Study 215
4.1 Step 1: Text Selection 215
4.2 Step 2a: Named Entity Recognition 216
4.3 Step 2b: Regular Expressions for Street Names 217
4.4 Step 3: Visualization of Relations Between Texts 219
5 Step 4: Information Processing 220
6 Conclusion 223
References 223
Towards a Historical Text Re-use Detection 226
1 Introduction 227
2 Data: Investigated Corpus and Initial Setup 229
3 Related Work 229
4 Algorithms: Text Re-use Techniques 230
5 Initial Setup 233
6 Results 234
6.1 Evaluation of Text Re-use Techniques for Paraphrase Detection 234
6.2 Extraction and Typing of Paradigmatic Relations 238
7 Further Work 240
8 Conclusion 240
References 241

Erscheint lt. Verlag	19.12.2014
Reihe/Serie	Theory and Applications of Natural Language Processing
Reihe/Serie	Theory and Applications of Natural Language Processing
Zusatzinfo	X, 238 p. 50 illus., 23 illus. in color.
Verlagsort	Cham
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik
Schlagworte	Big Data • Corpus processing • Dictionary acquisition • Natural Language Processing • Text Mining
ISBN-10	3-319-12655-5 / 3319126555
ISBN-13	978-3-319-12655-5 / 9783319126555

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 4,7 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Hardcover

CHF 179,70