Text Analytics with Python - Dipanjan Sarkar

Text Analytics with Python (eBook)

A Practitioner's Guide to Natural Language Processing

Dipanjan Sarkar (Autor)

eBook Download: PDF

2019 | 2nd ed.
XXIV, 674 Seiten
Apress (Verlag)
978-1-4842-4354-1 (ISBN)

You'll see how to use the latest state-of-the-art frameworks in NLP, coupled with machine learning and deep learning models for supervised sentiment analysis powered by Python to solve actual case studies. Start by reviewing Python for NLP fundamentals on strings and text data and move on to engineering representation methods for text data, including both traditional statistical models and newer deep learning-based embedding models. Improved techniques and new methods around parsing and processing text are discussed as well.

Text summarization and topic models have been overhauled so the book showcases how to build, tune, and interpret topic models in the context of an interest dataset on NIPS conference papers. Additionally, the book covers text similarity techniques with a real-world example of movie recommenders, along with sentiment analysis using supervised and unsupervised techniques.

There is also a chapter dedicated to semantic analysis where you'll see how to build your own named entity recognition (NER) system from scratch. While the overall structure of the book remains the same, the entire code base, modules, and chapters has been updated to the latest Python 3.x release.

What You'll Learn

•Understand NLP and text syntax, semantics and structure

•Discover text cleaning and feature engineering

•Review text classification and text clustering

• Assess text summarization and topic models

• Study deep learning for NLP

Who This Book Is For

IT professionals, data analysts, developers, linguistic experts, data scientists and engineers and basically anyone with a keen interest in linguistics, analytics and generating insights from textual data.

Dipanjan Sarkar is a Data Scientist at Intel, the world's largest silicon company which is on a mission to make the world more connected and productive. He primarily works on Analytics, Business Intelligence, Application Development and building large scale Intelligent Systems. He received his master's degree in Information Technology from the International Institute of Information Technology, Bangalore with a focus on Data Science and Software Engineering. He is also an avid supporter of self-learning, especially Massive Open Online Courses and holds a Data Science Specialisation from Johns Hopkins University on Coursera.

He has been an analytics practitioner for over six years, specializing in statistical, predictive and text analytics. He has also authored a books on R and Machine Learning and occasionally reviews technical books and acts as a course beta tester for Coursera. Dipanjan's interests include learning about new technology, financial markets, disruptive start-ups, data science and more recently, artificial intelligence and deep learning. In his spare time he loves reading, gaming and watching popular sitcoms and football.

Leverage Natural Language Processing (NLP) in Python and learn how to set up your own robust environment for performing text analytics. This second edition has gone through a major revamp and introduces several significant changes and new topics based on the recent trends in NLP. You'll see how to use the latest state-of-the-art frameworks in NLP, coupled with machine learning and deep learning models for supervised sentiment analysis powered by Python to solve actual case studies. Start by reviewing Python for NLP fundamentals on strings and text data and move on to engineering representation methods for text data, including both traditional statistical models and newer deep learning-based embedding models. Improved techniques and new methods around parsing and processing text are discussed as well. Text summarization and topic models have been overhauled so the book showcases how to build, tune, and interpret topic models in the context of an interest dataset on NIPS conference papers. Additionally, the book covers text similarity techniques with a real-world example of movie recommenders, along with sentiment analysis using supervised and unsupervised techniques.There is also a chapter dedicated to semantic analysis where you'll see how to build your own named entity recognition (NER) system from scratch. While the overall structure of the book remains the same, the entire code base, modules, and chapters has been updated to the latest Python 3.x release.What You'll Learn* Understand NLP and text syntax, semantics and structure* Discover text cleaning and feature engineering* Review text classification and text clustering * Assess text summarization and topic models* Study deep learning for NLPWho This Book Is ForIT professionals, data analysts, developers, linguistic experts, data scientists and engineers and basically anyone with a keen interest in linguistics, analytics and generating insights from textual data.

Table of Contents 5
About the Author 14
About the Technical Reviewer 15
Foreword 16
Acknowledgments 17
Introduction 19
Chapter 1: Natural Language Processing Basics 21
Natural Language 23
What Is Natural Language? 23
The Philosophy of Language 23
Language Acquisition and Usage 26
Language Acquisition and Cognitive Learning 27
Language Usage 28
Linguistics 30
Language Syntax and Structure 33
Words 35
Phrases 37
Clauses 40
Grammar 41
Dependency Grammar 42
Constituency Grammar 46
Word-Order Typology 53
Language Semantics 55
Lexical Semantic Relations 55
Lemmas and Wordforms 56
Homonyms, Homographs, and Homophones 56
Heteronyms and Heterographs 57
Polysemes 57
Capitonyms 57
Synonyms and Antonyms 57
Hyponyms and Hypernyms 58
Semantic Networks and Models 59
Representation of Semantics 61
Propositional Logic 61
First Order Logic 66
Text Corpora 71
Corpora Annotation and Utilities 72
Popular Corpora 73
Accessing Text Corpora 75
Accessing the Brown Corpus 76
Accessing the Reuters Corpus 79
Accessing the WordNet Corpus 80
Natural Language Processing 82
Machine Translation 82
Speech Recognition Systems 83
Question Answering Systems 84
Contextual Recognition and Resolution 84
Text Summarization 85
Text Categorization 85
Text Analytics 86
Machine Learning 87
Deep Learning 88
Summary 88
Chapter 2: Python for Natural Language Processing 89
Getting to Know Python 90
The Zen of Python 91
Applications: When Should You Use Python? 93
Drawbacks: When Should You Not Use Python? 95
Python Implementations and Versions 96
Setting Up a Robust Python Environment 98
Which Python Version? 98
Which Operating System? 99
Integrated Development Environments 99
Environment Setup 100
Package Management 104
Virtual Environments 105
Python Syntax and Structure 108
Working with Text Data 109
String Literals 109
Representing Strings 111
String Operations and Methods 113
Basic Operations 114
Indexing and Slicing 115
Methods 118
Formatting 120
Regular Expressions 122
Basic Text Processing and Analysis: Putting It All Together 126
Natural Language Processing Frameworks 131
Summary 133
Chapter 3: Processing and Understanding Text 135
Text Preprocessing and Wrangling 137
Removing HTML Tags 137
Text Tokenization 139
Sentence Tokenization 140
Default Sentence Tokenizer 141
Pretrained Sentence Tokenizer Models 143
PunktSentenceTokenizer 145
RegexpTokenizer 145
Word Tokenization 146
Default Word Tokenizer 147
TreebankWordTokenizer 147
TokTokTokenizer 148
RegexpTokenizer 149
Inherited Tokenizers from RegexpTokenizer 151
Building Robust Tokenizers with NLTK and spaCy 152
Removing Accented Characters 155
Expanding Contractions 156
Removing Special Characters 158
Case Conversions 158
Text Correction 159
Correcting Repeating Characters 159
Correcting Spellings 162
Stemming 168
Lemmatization 172
Removing Stopwords 174
Bringing It All Together — Building a Text Normalizer 175
Understanding Text Syntax and Structure 177
Installing Necessary Dependencies 179
Important Machine Learning Concepts 182
Parts of Speech Tagging 183
Building POS Taggers 186
Shallow Parsing or Chunking 192
Building Shallow Parsers 193
Dependency Parsing 203
Building Dependency Parsers 205
Constituency Parsing 210
Building Constituency Parsers 212
Summary 219
Chapter 4: Feature Engineering for Text Representation 220
Understanding Text Data 221
Building a Text Corpus 222
Preprocessing Our Text Corpus 224
Traditional Feature Engineering Models 227
Bag of Words Model 227
Bag of N-Grams Model 229
TF-IDF Model 230
Using TfidfTransformer 232
Using TfidfVectorizer 233
Understanding the TF-IDF Model 234
Extracting Features for New Documents 239
Document Similarity 239
Document Clustering with Similarity Features 241
Topic Models 245
Advanced Feature Engineering Models 250
Loading the Bible Corpus 252
Word2Vec Model 253
The Continuous Bag of Words (CBOW) Model 253
Implementing the Continuous Bag of Words (CBOW) Model 255
Build the Corpus Vocabulary 255
Build a CBOW (Context, Target) Generator 256
Build the CBOW Model Architecture 257
Train the Model 260
Get Word Embeddings 261
The Skip-Gram Model 263
Implementing the Skip-Gram Model 265
Build the Corpus Vocabulary 265
Build a Skip-Gram [(target, context), relevancy] Generator 266
Build the Skip-Gram Model Architecture 267
Train the Model 270
Get Word Embeddings 271
Robust Word2Vec Models with Gensim 274
Applying Word2Vec Features for Machine Learning Tasks 277
Strategy for Getting Document Embeddings 279
The GloVe Model 282
Applying GloVe Features for Machine Learning Tasks 284
The FastText Model 288
Applying FastText Features to Machine Learning Tasks 289
Summary 292
Chapter 5: Text Classification 293
What Is Text Classification? 295
Formal Definition 295
Major Text Classification Variants 296
Automated Text Classification 297
Formal Definition 299
Text Classification Task Variants 300
Text Classification Blueprint 300
Data Retrieval 303
Data Preprocessing and Normalization 305
Building Train and Test Datasets 310
Feature Engineering Techniques 311
Traditional Feature Engineering Models 312
Advanced Feature Engineering Models 313
Classification Models 314
Multinomial Naïve Bayes 316
Logistic Regression 319
Support Vector Machines 321
Ensemble Models 324
Random Forest 325
Gradient Boosting Machines 326
Evaluating Classification Models 327
Confusion Matrix 328
Understanding the Confusion Matrix 329
Performance Metrics 330
Building and Evaluating Our Text Classifier 333
Bag of Words Features with Classification Models 333
TF-IDF Features with Classification Models 337
Comparative Model Performance Evaluation 340
Word2Vec Embeddings with Classification Models 341
GloVe Embeddings with Classification Models 344
FastText Embeddings with Classification Models 345
Model Tuning 346
Model Performance Evaluation 352
Applications 359
Summary 359
Chapter 6: Text Summarization and Topic Models 361
Text Summarization and Information Extraction 362
Keyphrase Extraction 364
Topic Modeling 364
Automated Document Summarization 364
Important Concepts 365
Keyphrase Extraction 368
Collocations 369
Weighted Tag-Based Phrase Extraction 375
Topic Modeling 380
Topic Modeling on Research Papers 382
The Main Objective 382
Data Retrieval 383
Load and View Dataset 384
Basic Text Wrangling 385
Topic Models with Gensim 386
Text Representation with Feature Engineering 387
Latent Semantic Indexing 390
Implementing LSI Topic Models from Scratch 400
Latent Dirichlet Allocation 407
LDA Models with MALLET 417
LDA Tuning: Finding the Optimal Number of Topics 420
Interpreting Topic Model Results 427
Dominant Topics Distribution Across Corpus 428
Dominant Topics in Specific Research Papers 430
Relevant Research Papers per Topic Based on Dominance 431
Predicting Topics for New Research Papers 433
Topic Models with Scikit-Learn 436
Text Representation with Feature Engineering 437
Latent Semantic Indexing 437
Latent Dirichlet Allocation 443
Non-Negative Matrix Factorization 446
Predicting Topics for New Research Papers 450
Visualizing Topic Models 452
Automated Document Summarization 453
Text Wrangling 457
Text Representation with Feature Engineering 458
Latent Semantic Analysis 459
TextRank 463
Summary 468
Chapter 7: Text Similarity and Clustering 470
Essential Concepts 472
Information Retrieval (IR) 472
Feature Engineering 472
Similarity Measures 473
Unsupervised Machine Learning Algorithms 474
Text Similarity 474
Analyzing Term Similarity 475
Hamming Distance 478
Manhattan Distance 479
Euclidean Distance 481
Levenshtein Edit Distance 482
Cosine Distance and Similarity 488
Analyzing Document Similarity 492
Building a Movie Recommender 493
Load and View Dataset 494
Text Preprocessing 497
Extract TF-IDF Features 498
Cosine Similarity for Pairwise Document Similarity 499
Find Top Similar Movies for a Sample Movie 500
Find Movie ID 500
Get Movie Similarities 500
Get Top Five Similar Movie IDs 500
Get Top Five Similar Movies 501
Build a Movie Recommender 501
Get a List of Popular Movies 502
Okapi BM25 Ranking for Pairwise Document Similarity 505
Document Clustering 514
Clustering Movies 517
Feature Engineering 517
K-Means Clustering 518
Affinity Propagation 525
Ward's Agglomerative Hierarchical Clustering 529
Summary 534
Chapter 8: Semantic Analysis 535
Semantic Analysis 536
Exploring WordNet 537
Understanding Synsets 538
Analyzing Lexical Semantic Relationships 539
Entailments 540
Homonyms and Homographs 540
Synonyms and Antonyms 541
Hyponyms and Hypernyms 542
Holonyms and Meronyms 545
Semantic Relationships and Similarity 546
Word Sense Disambiguation 549
Named Entity Recognition 552
Building an NER Tagger from Scratch 560
Building an End-to-End NER Tagger with Our Trained NER Model 570
Analyzing Semantic Representations 574
Propositional Logic 574
First Order Logic 576
Summary 582
Chapter 9: Sentiment Analysis 583
Problem Statement 584
Setting Up Dependencies 585
Getting the Data 585
Text Preprocessing and Normalization 586
Unsupervised Lexicon-Based Models 588
Bing Liu's Lexicon 590
MPQA Subjectivity Lexicon 590
Pattern Lexicon 591
TextBlob Lexicon 591
AFINN Lexicon 594
SentiWordNet Lexicon 596
VADER Lexicon 600
Classifying Sentiment with Supervised Learning 603
Traditional Supervised Machine Learning Models 606
Newer Supervised Deep Learning Models 609
Advanced Supervised Deep Learning Models 618
Analyzing Sentiment Causation 630
Interpreting Predictive Models 630
Analyzing Topic Models 638
Summary 645
Chapter 10: The Promise of Deep Learning 646
Why Are We Crazy for Embeddings? 648
Trends in Word-Embedding Models 650
Trends in Universal Sentence-Embedding Models 651
Understanding Our Text Classification Problem 657
Universal Sentence Embeddings in Action 658
Load Up Dependencies 658
Load and View the Dataset 659
Building Train, Validation, and Test Datasets 660
Basic Text Wrangling 660
Build Data Ingestion Functions 662
Build Deep Learning Model with Universal Sentence Encoder 663
Model Training 664
Model Evaluation 666
Bonus: Transfer Learning with Different Universal Sentence Embeddings 667
Summary and Future Scope 674
Index 675

Erscheint lt. Verlag	21.5.2019
Zusatzinfo	XXIV, 674 p. 189 illus.
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge
	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
	Mathematik / Informatik ► Informatik ► Web / Internet
Schlagworte	Deep Learning in Text Analysis • Natural Language Basics • Python • sentiment analysis • text classification • Text Clustering • Text Mining
ISBN-10	1-4842-4354-4 / 1484243544
ISBN-13	978-1-4842-4354-1 / 9781484243541

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 17,8 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

CHF 67,35