Machine Learning Using R (eBook)
XXIII, 566 Seiten
Apress (Verlag)
978-1-4842-2334-5 (ISBN)
This book is inspired by the Machine Learning Model Building Process Flow, which provides the reader the ability to understand a ML algorithm and apply the entire process of building a ML model from the raw data.
This new paradigm of teaching Machine Learning will bring about a radical change in perception for many of those who think this subject is difficult to learn. Though theory sometimes looks difficult, especially when there is heavy mathematics involved, the seamless flow from the theoretical aspects to example-driven learning provided in Blockchain and Capitalism makes it easy for someone to connect the dots.
For every Machine Learning algorithm covered in this book, a 3-D approach of theory, case-study and practice will be given. And where appropriate, the mathematics will be explained through visualization in R.
All practical demonstrations will be explored in R, a powerful programming language and software environment for statistical computing and graphics. The various packages and methods available in R will be used to explain the topics. In the end, readers will learn some of the latest technological advancements in building a scalable machine learning model with Big Data.
Karthik Ramasubramanian, works for one of the largest and fastest growing technology unicorn in India, Hike Messenger. He brings the best of Business Analytics and Data Science experience to his role at Hike Messenger. In his 7 years of research and industry experience, he has worked on cross-industry data science problems in retail, e-commerce, and technology, developing and prototyping data driven solutions. In his previous role at Snapdeal, one of the largest e-commerce retailer in India, he was leading core statistical modelling initiatives for customer growth and pricing analytics. Prior to Snapdeal, he was part of central database team, managing the data warehouses for global business applications of Reckitt Benckiser (RB). He has rich experience working with scalable machine learning solutions for industry, including sophisticated graph network and self-learning neural networks. He has a Masters in Theoretical Computer Science from PSG College of Technology, Anna University and certified big data professional. He is passionate about teaching and mentoring future data scientist through different online and public forums. He enjoys writing poems in his leisure time and an avid traveler.
Examine the latest technological advancements in building a scalable machine learning model with Big Data using R. This book shows you how to work with a machine learning algorithm and use it to build a ML model from raw data. All practical demonstrations will be explored in R, a powerful programming language and software environment for statistical computing and graphics. The various packages and methods available in R will be used to explain the topics. For every machine learning algorithm covered in this book, a 3-D approach of theory, case-study and practice will be given. And where appropriate, the mathematics will be explained through visualization in R. All the images are available in color and hi-res as part of the code download. This new paradigm of teaching machine learning will bring about a radical change in perception for many of those who think this subject is difficult to learn. Though theory sometimes looks difficult, especially when there is heavy mathematics involved, the seamless flow from the theoretical aspects to example-driven learning provided in this book makes it easy for someone to connect the dots..What You'll Learn Use the model building process flowApply theoretical aspects of machine learningReview industry-based cae studiesUnderstand ML algorithms using RBuild machine learning models using Apache Hadoop and SparkWho This Book is ForData scientists, data science professionals and researchers in academia who want to understand the nuances of machine learning approaches/algorithms along with ways to see them in practice using R. The book will also benefit the readers who want to understand the technology behind implementing a scalable machine learning model using Apache Hadoop, Hive, Pig and Spark.
Karthik Ramasubramanian, works for one of the largest and fastest growing technology unicorn in India, Hike Messenger. In his 7 years of research and industry experience, he has worked on cross-industry data science problems in retail, e-commerce, and technology, developing and prototyping data driven solutions. In his previous role at Snapdeal, one of the largest e-commerce retailers in India, he was leading core statistical modelling initiatives for customer growth and pricing analytics. Prior to Snapdeal, he was part of central database team, managing the data warehouses for global business applications of Reckitt Benckiser (RB). He has a Masters in Theoretical Computer Science from PSG College of Technology, Anna University and certified big data professional. He is passionate about teaching and mentoring future data scientists through different online and public forums. He also enjoys writing poems in his spare time and is an avid traveler.Abhishek Singh, is based in Ireland as a Data Scientist in the Advanced Data Science team for Prudential Financial Inc. He has 5 years of professional and academic experience in the Data Science field. At Deloitte Advisory, he led Risk Analytics initiatives for top US banks in their regulatory risk, credit risk, and balance sheet modelling requirements. In his current role, he is working on scalable machine learning algorithms for Individual Life Insurance branch of Prudential. He was also a trainer at Deloitte Professional University and development initiatives for professionals in the areas of statistics, economics, financial risk and data science tools (SAS and R). Abhishek is a B.Tech. in Mathematics and Computing from Indian Institute of Technology, Guwahati and has an MBA from Indian Institute of Management, Bangalore. He speaks at public events on Data Science and is working with leading universities towards bringing data science skills to graduates. He also holds a Post Graduate Diploma in Cyber Law from NALSAR University. He enjoys cooking and photography during his free hours.
Contents at a Glance 5
Contents 6
About the Authors 17
About the Technical Reviewer 19
Acknowledgments 20
Chapter 1: Introduction to Machine Learning and R 21
1.1 Understanding the Evolution 22
1.1.1 Statistical Learning 22
1.1.2 Machine Learning (ML) 23
1.1.3 Artificial Intelligence (AI) 23
1.1.4 Data Mining 24
1.1.5 Data Science 25
1.2 Probability and Statistics 26
1.2.1 Counting and Probability Definition 27
1.2.2 Events and Relationships 29
1.2.2.1 Independent Events 29
1.2.2.2 Conditional Independence 30
1.2.2.3 Bayes Theorem 30
1.2.3 Randomness, Probability, and Distributions 32
1.2.4 Confidence Interval and Hypothesis Testing 33
1.2.4.1 Confidence Interval 34
1.2.4.2 Hypothesis Testing 35
1.3 Getting Started with R 38
1.3.1 Basic Building Blocks 38
1.3.1.1 Calculations 38
1.3.1.2 Statistics with R 39
1.3.1.3 Packages 39
1.3.2 Data Structures in R 39
1.3.2.1 Vectors 40
1.3.2.2 List 40
1.3.2.3 Matrix 40
1.3.2.4 Data Frame 41
1.3.3 Subsetting 41
1.3.3.1 Vectors 41
1.3.3.2 Lists 42
1.3.3.3 Matrixes 42
1.3.3.4 Data Frames 43
1.3.4 Functions and Apply Family 43
1.4 Machine Learning Process Flow 46
1.4.1 Plan 46
1.4.2 Explore 46
1.4.3 Build 47
1.4.4 Evaluate 47
1.5 Other Technologies 48
1.6 Summary 48
1.7 References 48
Chapter 2: Data Preparation and Exploration 50
2.1 Planning the Gathering of Data 51
2.1.1 Variables Types 51
2.1.1.1 Categorical Variables 51
2.1.1.2 Continuous Variables 52
2.1.2 Data Formats 52
2.1.2.1 Comma-Separated Values 53
2.1.2.2 Microsoft Excel 53
2.1.2.3 Extensible Markup Language: XML 53
2.1.2.4 Hypertext Markup Language: HTML 55
2.1.2.5 JSON 57
2.1.2.6 Other Formats 59
2.1.3 Data Sources 59
2.1.3.1 Structured 59
2.1.3.2 Semi-Structured 59
2.1.3.3 Unstructured 59
2.2 Initial Data Analysis (IDA) 60
2.2.1 Discerning a First Look 60
2.2.1.1 Function str() 60
2.2.1.2 Naming Convention: make.names() 61
2.2.1.3 Table(): Pattern or Trend 62
2.2.2 Organizing Multiple Sources of Data into One 62
2.2.2.1 Merge and dplyr Joins 62
2.2.2.1.1 Using merge 63
2.2.2.1.2 dplyr 64
2.2.3 Cleaning the Data 65
2.2.3.1 Correcting Factor Variables 65
2.2.3.2 Dealing with NAs 66
2.2.3.3 Dealing with Dates and Times 67
2.2.3.3.1 Time Zone 68
2.2.3.3.2 Daylight Savings Time 68
2.2.4 Supplementing with More Information 68
2.2.4.1 Derived Variables 69
2.2.4.2 n-day Averages 69
2.2.5 Reshaping 69
2.3 Exploratory Data Analysis 70
2.3.1 Summary Statistics 71
2.3.1.1 Quantile 71
2.3.1.2 Mean 72
2.3.1.3 Frequency Plot 73
2.3.1.4 Boxplot 73
2.3.2 Moment 74
2.3.2.1 Variance 75
2.3.2.2 Skewness 76
2.3.2.3 Kurtosis 78
2.4 Case Study: Credit Card Fraud 80
2.4.1 Data Import 80
2.4.2 Data Transformation 81
2.4.3 Data Exploration 82
2.5 Summary 84
2.6 References 84
Chapter 3: Sampling and Resampling Techniques 85
3.1 Introduction to Sampling 86
3.2 Sampling Terminology 87
3.2.1 Sample 87
3.2.2 Sampling Distribution 88
3.2.3 Population Mean and Variance 88
3.2.4 Sample Mean and Variance 88
3.2.5 Pooled Mean and Variance 88
3.2.6 Sample Point 89
3.2.7 Sampling Error 89
3.2.8 Sampling Fraction 90
3.2.9 Sampling Bias 90
3.2.10 Sampling Without Replacement (SWOR) 90
3.2.11 Sampling with Replacement (SWR) 90
3.3 Credit Card Fraud: Population Statistics 91
3.3.1 Data Description 91
3.3.2 Population Mean 92
3.3.3 Population Variance 92
3.3.4 Pooled Mean and Variance 93
3.4 Business Implications of Sampling 96
3.4.1 Features of Sampling 97
3.4.2 Shortcomings of Sampling 97
3.5 Probability and Non-Probability Sampling 97
3.5.1 Types of Non-Probability Sampling 98
3.5.1.1 Convenience Sampling 98
3.5.1.2 Purposive Sampling 99
3.5.1.3 Quota Sampling 99
3.6 Statistical Theory on Sampling Distributions 99
3.6.1 Law of Large Numbers: LLN 99
3.6.1.1 Weak Law of Large Numbers 100
3.6.1.2 Strong Law of Large Numbers 100
3.6.1.3 Steps in Simulation with R Code 101
3.6.2 Central Limit Theorem 103
3.6.2.1 Steps in Simulation with R Code 103
3.7 Probability Sampling Techniques 107
3.7.1 Population Statistics 107
3.7.2 Simple Random Sampling 111
3.7.3 Systematic Random Sampling 118
3.7.4 Stratified Random Sampling 122
3.7.5 Cluster Sampling 129
3.7.6 Bootstrap Sampling 135
3.8 Monte Carlo Method: Acceptance-Rejection Method 142
3.9 A Qualitative Account of Computational Savings by Sampling 144
3.10 Summary 145
Chapter 4: Data Visualization in R 146
4.1 Introduction to the ggplot2 Package 147
4.2 World Development Indicators 149
4.3 Line Chart 149
4.4 Stacked Column Charts 155
4.5 Scatterplots 161
4.6 Boxplots 162
4.7 Histograms and Density Plots 165
4.8 Pie Charts 169
4.9 Correlation Plots 171
4.10 HeatMaps 173
4.11 Bubble Charts 175
4.12 Waterfall Charts 179
4.13 Dendogram 182
4.14 Wordclouds 184
4.15 Sankey Plots 186
4.16 Time Series Graphs 187
4.17 Cohort Diagrams 189
4.18 Spatial Maps 191
4.19 Summary 195
4.20 References 196
Chapter 5: Feature Engineering 197
5.1 Introduction to Feature Engineering 198
5.1.1 Filter Methods 200
5.1.2 Wrapper Methods 200
5.1.3 Embedded Methods 200
5.2 Understanding the Working Data 201
5.2.1 Data Summary 202
5.2.2 Properties of Dependent Variable 202
5.2.3 Features Availability: Continuous or Categorical 205
5.2.4 Setting Up Data Assumptions 207
5.3 Feature Ranking 207
5.4 Variable Subset Selection 211
5.4.1 Filter Method 211
5.4.2 Wrapper Methods 215
5.4.3 Embedded Methods 222
5.5 Dimensionality Reduction 226
5.6 Feature Engineering Checklist 231
5.7 Summary 233
5.8 References 233
Chapter 6: Machine Learning Theory and Practices 234
6.1 Machine Learning Types 237
6.1.1 Supervised Learning 237
6.1.2 Unsupervised Learning 238
6.1.3 Semi-Supervised Learning 238
6.1.4 Reinforcement Learning 238
6.2 Groups of Machine Learning Algorithms 239
6.3 Real-World Datasets 244
6.3.1 House Sale Prices 244
6.3.2 Purchase Preference 245
6.3.3 Twitter Feeds and Article 246
6.3.4 Breast Cancer 246
6.3.5 Market Basket 247
6.3.6 Amazon Food Review 247
6.4 Regression Analysis 248
6.5 Correlation Analysis 250
6.5.1 Linear Regression 253
6.5.1.2 Best Linear Predictors 254
6.5.2 Simple Linear Regression 256
6.5.3 Multiple Linear Regression 259
6.5.4 Model Diagnostics: Linear Regression 262
6.5.4.1 Influential Point Analysis 263
6.5.4.2 Normality of Residuals 267
6.5.4.3 Multicollinearity 269
6.5.4.4 Residual Autocorrelation 271
6.5.4.5 Homoscedasticity 273
6.5.5 Polynomial Regression 276
6.5.6 Logistic Regression 280
6.5.7 Logit Transformation 281
6.5.8 Odds Ratio 282
6.5.8.1 Binomial Logistic Model 284
6.5.9 Model Diagnostics: Logistic Regression 290
6.5.9.1 Wald Test 290
6.5.9.2 Deviance 291
6.5.9.3 Pseudo R-Square 292
6.5.9.4 Bivariate Plots 293
6.5.9.5 Cumulative Gains and Lift Charts 296
6.5.9.6 Concordance and Discordant Ratios 299
6.5.10 Multinomial Logistic Regression 300
6.5.11 Generalized Linear Models 304
6.5.12 Conclusion 305
6.6 Support Vector Machine SVM 305
6.6.1 Linear SVM 307
6.6.1.1 Hard Margins 307
6.6.1.2 Soft Margins 307
6.6.2 Binary SVM Classifier 308
6.6.3 Multi-Class SVM 310
6.6.4 Conclusion 312
6.7 Decision Trees 312
6.7.1 Types of Decision Trees 313
6.7.1.1 Regression Trees 314
6.7.1.2 Classification Tree 315
6.7.2 Decision Measures 315
6.7.2.1 Gini Index 315
6.7.2.2 Entropy 316
6.7.2.3 Information Gain 317
6.7.3 Decision Tree Learning Methods 317
6.7.3.1 Iterative Dichotomizer 3 319
6.7.3.2 C5.0 algorithm 322
6.7.3.3 Classification and Regression Tree: CART 327
6.7.3.4 Chi-Square Automated Interaction Detection: CHAID 330
6.7.4 Ensemble Trees 336
6.7.4.1 Boosting 336
6.7.4.2 Bagging 338
6.7.4.2.1 Bagging CART 339
6.7.4.2.2 Random Forest 341
6.7.5 Conclusion 344
6.8 The Naive Bayes Method 345
6.8.1 Conditional Probability 345
6.8.2 Bayes Theorem 345
6.8.3 Prior Probability 346
6.8.4 Posterior Probability 346
6.8.5 Likelihood and Marginal Likelihood 346
6.8.6 Naive Bayes Methods 347
6.8.7 Conclusion 352
6.9 Cluster Analysis 352
6.9.1 Introduction to Clustering 353
6.9.2 Clustering Algorithms 354
6.9.2.1 Hierarchal Clustering 356
6.9.2.2 Centroid-Based Clustering 359
6.9.2.3 Distribution-Based Clustering 362
6.9.2.4 Density-Based Clustering 364
6.9.3 Internal Evaluation 366
6.9.3.1 Dunn Index 366
6.9.3.2 Silhouette Coefficient 367
6.9.4 External Evaluation 368
6.9.4.1 Rand Measure 368
6.9.4.2 Jaccard Index 369
6.9.5 Conclusion 369
6.10 Association Rule Mining 369
6.10.1 Introduction to Association Concepts 370
6.10.1.1 Support 370
6.10.1.2 Confidence 371
6.10.1.3 Lift 371
6.10.2 Rule-Mining Algorithms 372
6.10.2.1 Apriori 375
6.10.2.2 Eclat 377
6.10.3 Recommendation Algorithms 379
6.10.3.1 User-Based Collaborative Filtering (UBCF) 380
6.10.3.2 Item-Based Collaborative Filtering (IBCF) 381
6.10.4 Conclusion 387
6.11 Artificial Neural Networks 387
6.11.1 Human Cognitive Learning 387
6.11.2 Perceptron 389
6.11.3 Sigmoid Neuron 392
6.11.4 Neural Network Architecture 392
6.11.5 Supervised versus Unsupervised Neural Nets 394
6.11.6 Neural Network Learning Algorithms 395
6.11.6.1 Evolutionary Methods 396
6.11.6.2 Gene Expression Programming 396
6.11.6.3 Simulated Annealing 396
6.11.6.4 Expectation Maximization 397
6.11.6.5 Non-Parametric Methods 397
6.11.6.6 Particle Swarm Optimization 397
6.11.7 Feed-Forward Back-Propagation 397
6.11.7.1 Purchase Prediction: Neural Network-Based Classification 399
6.11.8 Deep Learning 404
6.11.9 Conclusion 411
6.12 Text-Mining Approaches 411
6.12.1 Introduction to Text Mining 412
6.12.2 Text Summarization 413
6.12.3 TF-IDF 415
6.12.4 Part-of-Speech (POS) Tagging 417
6.12.5 Word Cloud 421
6.12.6 Text Analysis: Microsoft Cognitive Services 422
6.12.7 Conclusion 432
6.13 Online Machine Learning Algorithms 432
6.13.1 Fuzzy C-Means Clustering 434
6.13.2 Conclusion 437
6.14 Model Building Checklist 437
6.15 Summary 438
6.16 References 438
Chapter 7: Machine Learning Model Evaluation 440
7.1 Dataset 441
7.1.1 House Sale Prices 441
7.1.2 Purchase Preference 443
7.2 Introduction to Model Performance and Evaluation 445
7.3 Objectives of Model Performance Evaluation 446
7.4 Population Stability Index 447
7.5 Model Evaluation for Continuous Output 452
7.5.1 Mean Absolute Error 454
7.5.2 Root Mean Square Error 456
7.5.3 R-Square 457
7.6 Model Evaluation for Discrete Output 460
7.6.1 Classification Matrix 461
7.6.2 Sensitivity and Specificity 466
7.6.3 Area Under ROC Curve 467
7.7 Probabilistic Techniques 470
7.7.1 K-Fold Cross Validation 471
7.7.2 Bootstrap Sampling 473
7.8 The Kappa Error Metric 474
7.9 Summary 478
7.10 References 479
Chapter 8: Model Performance Improvement 480
8.1 Machine Learning and Statistical Modeling 481
8.2 Overview of the Caret Package 483
8.3 Introduction to Hyper-Parameters 485
8.4 Hyper-Parameter Optimization 489
8.4.1 Manual Search 490
8.4.2 Manual Grid Search 492
8.4.3 Automatic Grid Search 494
8.4.4 Optimal Search 496
8.4.5 Random Search 498
8.4.6 Custom Searching 500
8.5 The Bias and Variance Tradeoff 503
8.5.1 Bagging or Bootstrap Aggregation 507
8.5.2 Boosting 508
8.6 Introduction to Ensemble Learning 508
8.6.1 Voting Ensembles 509
8.6.2 Advanced Methods in Ensemble Learning 510
8.6.2.1 Bagging 510
8.6.2.2 Boosting 512
8.7 Ensemble Techniques Illustration in R 513
8.7.1 Bagging Trees 513
8.7.2 Gradient Boosting with a Decision Tree 515
8.7.3 Blending KNN and Rpart 520
8.7.4 Stacking Using caretEnsemble 521
8.8 Advanced Topic: Bayesian Optimization of Machine Learning Models 526
8.9 Summary 531
8.10 References 532
Chapter 9: Scalable Machine Learning and Related Technologies 533
9.1 Distributed Processing and Storage 534
9.1.1 Google File System (GFS) 534
9.1.2 MapReduce 536
9.1.3 Parallel Execution in R 537
9.1.3.1 Setting the Cores 537
9.1.3.2 Problem Statement 538
9.1.3.3 Building the model: Serial 539
9.1.3.4 Building the Model: Parallel 539
9.1.3.5 Stopping the Clusters 540
9.2 The Hadoop Ecosystem 540
9.2.1 MapReduce 541
9.2.1.1 MapReduce Example: Word Count 541
9.2.2 Hive 545
9.2.2.1 Creating Tables 546
9.2.2.2 Describing Tables 546
9.2.2.3 Generating Data and Storing it in a Local File 547
9.2.2.4 Loading the Data into the Hive Table 547
9.2.2.5 Selecting a Query 548
9.2.3 Apache Pig 549
9.2.3.1 Connecting to Pig 549
9.2.3.2 Loading the Data 550
9.2.3.3 Tokenizing Each Line 550
9.2.3.4 Flattening the Tokens 551
9.2.3.5 Grouping the Words 551
9.2.3.6 Counting and Sorting 552
9.2.4 HBase 552
9.2.4.1 Starting HBase 553
9.2.4.2 Creating the Table and Put Data 553
9.2.4.3 Scanning the Data 554
9.2.5 Spark 554
9.3 Machine Learning in R with Spark 555
9.3.1 Setting the Environment Variable 556
9.3.2 Initializing the Spark Session 556
9.3.3 Loading Data and the Running Pre-Process 556
9.3.4 Creating SparkDataFrame 557
9.3.5 Building the ML Model 558
9.3.6 Predicting the Test Data 559
9.3.7 Stopping the SparkR Session 560
9.4 Machine Learning in R with H2O 560
9.4.1 Installation of Packages 561
9.4.2 Initialization of H2O Clusters 561
9.4.3 Deep Learning Demo in R with H2O 562
9.4.3.1 Running the Demo 563
9.4.3.2 Loading the Testing Data 563
9.5 Summary 567
9.6 References 568
Index 569
Erscheint lt. Verlag | 22.12.2016 |
---|---|
Zusatzinfo | XXIII, 566 p. 209 illus., 155 illus. in color. |
Verlagsort | Berkeley |
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Datenbanken |
Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge | |
Mathematik / Informatik ► Informatik ► Software Entwicklung | |
Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik | |
Schlagworte | Data Exploration • Data Visualization • feature engineering • machine learning • Machine Learning Models • Sampling Techniques • scalable machine learning |
ISBN-10 | 1-4842-2334-9 / 1484223349 |
ISBN-13 | 978-1-4842-2334-5 / 9781484223345 |
Haben Sie eine Frage zum Produkt? |
Größe: 11,9 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich