Learn the fundamental aspects of the business statistics, data mining, and machine learning techniques required to understand the huge amount of data generated by your organization. This book explains practical business analytics through examples, covers the steps involved in using it correctly, and shows you the context in which a particular technique does not make sense. Further, Practical Business Analytics using R helps you understand specific issues faced by organizations and how the solutions to these issues can be facilitated by business analytics.
This book will discuss and explore the following through examples and case studies:
- An introduction to R: data management and R functions
- The architecture, framework, and life cycle of a business analytics project
- Descriptive analytics using R: descriptive statistics and data cleaning
- Data mining: classification, association rules, and clustering
Predictive analytics: simple regression, multiple regression, and logistic regression
This book includes case studies on important business analytic techniques, such as classification, association, clustering, and regression. The R language is the statistical tool used to demonstrate the concepts throughout the book.
What You Will Learn
• Write R programs to handle data
• Build analytical models and draw useful inferences from them
• Discover the basic concepts of data mining and machine learning
• Carry out predictive modeling
• Define a business issue as an analytical problem
Who This Book Is For
Beginners who want to understand and learn the fundamentals of analytics using R. Students, managers, executives, strategy and planning professionals, software professionals, and BI/DW professionals.
Umesh R. Hodeghatta, Ph.D.
Dr. Umesh Rao. Hodeghatta is an acclaimed professional in the field of machine learning, NLP and business analytics. He has his master's degree in EE from Oklahoma State University, USA and Ph.D. from the Indian Institute of Technology (IIT), Kharagpur with a specialization in Machine Learning and NLP. Dr. Umesh Hodeghatta is currently working as a Data Scientist in United States serving multiple clients. He has more than 20 years of work experience and has held technical and senior management positions at XIM-Bhubaneswar, McAfee, Cisco Systems, and AT&T Bell Laboratories, USA. He has recently established IBM Big Data Analytics Lab and HP Research Lab at Xavier Univeristy. Dr. Hodeghatta has published many journal articles in international journals and conference proceedings, viz, 'Understading Twitter as e-WOM', 'Sentiment Analysis of Hollywood Movies on Twitter', 'PCI DSS - Penalty of not being Compliant' are few of the well-known publications. In addition, he has authored a book titled 'The InfoSec Handbook: An Introduction to Information Security' published by Springer Apress, USA. Dr. Hodeghatta has contributed his services to many professional organizations and regulatory bodies. He was an Executive Committee member of IEEE Computer Society (India); Academic advisory member for the Information and Security Audit Association (ISACA), USA; IT advisor for the government of Odisha, India; Technical Advisory Member of the International Neural Network Society (INNS) India; Advisory member of Task Force on Business Intelligence & Knowledge Management. Owing to these achievements, he has been listed in 'World's Who's Who' of the year - 2012, 2013, 2014, 2015 - published by Marquis Who's Who, USA. He is also a senior member of the IEEE, USA. Further details about Dr. Hodeghatta is available at http://www.mytechnospeak.com
Umesha Nayak is a director and principal consultant of MUSA Software Engineering Pvt. Ltd. which focuses on systems / process / management consulting. He has 33 years' experience, of which 12 years are in providing consulting to IT / manufacturing and other organizations from across the globe. He is a Master of Science in Software Systems; Master of Arts in Economics; CAIIB; Certified Information Systems Auditor (CISA), and Certified Risk and Information Systems Control (CRISC) professional from ISACA, US; PGDFM; Certified Ethical Hacker from EC Council; Certified Lead Auditor for many of the standards; Certified Coach among others. He has worked extensively in banking, software development, product design and development, project management, program management, information technology audits, information application audits, quality assurance, coaching, product reliability, human resource management, and consultancy. He was Vice President and Corporate Executive Council member at Polaris Software Lab, Chennai prior to his current assignment. He also held various roles like Head of Quality, Head of SEPG and Head of Strategic Practice Unit - Risks & Treasury at Polaris Software Lab. He started his journey with computers in 1981 with ICL mainframes and continued further with minis and PCs. He was one of the founding members of the information systems auditing in the banking industry in India. He has effectively guided many organizations through successful ISO 9001/ISO 27001/CMMI and other certifications and process/product improvements. He has coauthored the book 'The InfoSec Handbook: An Introduction to Information Security' published by Apress Open.
Learn the fundamental aspects of the business statistics, data mining, and machine learning techniques required to understand the huge amount of data generated by your organization. This book explains practical business analytics through examples, covers the steps involved in using it correctly, and shows you the context in which a particular technique does not make sense. Further, Practical Business Analytics using R helps you understand specific issues faced by organizations and how the solutions to these issues can be facilitated by business analytics.This book will discuss and explore the following through examples and case studies:An introduction to R: data management and R functionsThe architecture, framework, and life cycle of a business analytics projectDescriptive analytics using R: descriptive statistics and data cleaningData mining: classification, association rules, and clustering Predictiveanalytics: simple regression, multiple regression, and logistic regression This book includes case studies on important business analytic techniques, such as classification, association, clustering, and regression. The R language is the statistical tool used to demonstrate the concepts throughout the book.What You Will Learn* Write R programs to handle data* Build analytical models and draw useful inferences from them* Discover the basic concepts of data mining and machine learning * Carry out predictive modeling* Define a business issue as an analytical problemWho This Book Is ForBeginners who want to understand and learn the fundamentals of analytics using R. Students, managers, executives, strategy and planning professionals, software professionals, and BI/DW professionals.
Umesh R. Hodeghatta, Ph.D. Dr. Umesh Rao. Hodeghatta is an acclaimed professional in the field of machine learning, NLP and business analytics. He has his master’s degree in EE from Oklahoma State University, USA and Ph.D. from the Indian Institute of Technology (IIT), Kharagpur with a specialization in Machine Learning and NLP. Dr. Umesh Hodeghatta is currently working as a Data Scientist in United States serving multiple clients. He has more than 20 years of work experience and has held technical and senior management positions at XIM-Bhubaneswar, McAfee, Cisco Systems, and AT&T Bell Laboratories, USA. He has recently established IBM Big Data Analytics Lab and HP Research Lab at Xavier Univeristy. Dr. Hodeghatta has published many journal articles in international journals and conference proceedings, viz, “Understading Twitter as e-WOM”, “Sentiment Analysis of Hollywood Movies on Twitter”, “PCI DSS - Penalty of not being Compliant” are few of the well-known publications. In addition, he has authored a book titled “The InfoSec Handbook: An Introduction to Information Security” published by Springer Apress, USA. Dr. Hodeghatta has contributed his services to many professional organizations and regulatory bodies. He was an Executive Committee member of IEEE Computer Society (India); Academic advisory member for the Information and Security Audit Association (ISACA), USA; IT advisor for the government of Odisha, India; Technical Advisory Member of the International Neural Network Society (INNS) India; Advisory member of Task Force on Business Intelligence & Knowledge Management. Owing to these achievements, he has been listed in "World’s Who’s Who" of the year - 2012, 2013, 2014, 2015 - published by Marquis Who's Who, USA. He is also a senior member of the IEEE, USA. Further details about Dr. Hodeghatta is available at http://www.mytechnospeak.comUmesha Nayak is a director and principal consultant of MUSA Software Engineering Pvt. Ltd. which focuses on systems / process / management consulting. He has 33 years’ experience, of which 12 years are in providing consulting to IT / manufacturing and other organizations from across the globe. He is a Master of Science in Software Systems; Master of Arts in Economics; CAIIB; Certified Information Systems Auditor (CISA), and Certified Risk and Information Systems Control (CRISC) professional from ISACA, US; PGDFM; Certified Ethical Hacker from EC Council; Certified Lead Auditor for many of the standards; Certified Coach among others. He has worked extensively in banking, software development, product design and development, project management, program management, information technology audits, information application audits, quality assurance, coaching, product reliability, human resource management, and consultancy. He was Vice President and Corporate Executive Council member at Polaris Software Lab, Chennai prior to his current assignment. He also held various roles like Head of Quality, Head of SEPG and Head of Strategic Practice Unit – Risks & Treasury at Polaris Software Lab. He started his journey with computers in 1981 with ICL mainframes and continued further with minis and PCs. He was one of the founding members of the information systems auditing in the banking industry in India. He has effectively guided many organizations through successful ISO 9001/ISO 27001/CMMI and other certifications and process/product improvements. He has coauthored the book “The InfoSec Handbook: An Introduction to Information Security” published by Apress Open.
Contents at a Glance 4
Contents 5
About the Authors 15
About the Technical Reviewer 17
Chapter 1: Overview of Business Analytics 18
1.1 Objectives of This Book 20
1.2 Confusing Terminology 21
1.3 Drivers for Business Analytics 22
1.3.1 Growth of Computer Packages and Applications 23
1.3.2 Feasibility to Consolidate Data from Various Sources 24
1.3.3 Growth of Infinite Storage and Computing Capability 24
1.3.4 Easy-to-Use Programming Tools and Platforms 24
1.3.5 Survival and Growth in the Highly Competitive World 24
1.3.6 Business Complexity Growing out of Globalization 25
1.4 Applications of Business Analytics 25
1.4.1 Marketing and Sales 25
1.4.2 Human Resources 26
1.4.3 Product Design 26
1.4.4 Service Design 26
1.4.5 Customer Service and Support Areas 26
1.5 Skills Required for a Business Analyst 27
1.5.1 Understanding the Business and Business Problems 27
1.5.2 Understanding Data Analysis Techniques and Algorithms 27
1.5.3 Having Good Computer Programming Knowledge 28
1.5.4 Understanding Data Structures and Data Storage/Warehousing Techniques 28
1.5.5 Knowing Relevant Statistical and Mathematical Concepts 28
1.6 Life Cycle of a Business Analytics Project 28
1.7 The Framework for Business Analytics 31
1.8 Summary 32
Chapter 2: Introduction to R 33
2.1 Data Analysis Tools 33
2.2 R Installation 37
2.2.1 Installing R 37
2.2.2 Installing RStudio 39
2.2.3 Exploring the RStudio Interface 39
2.3 Basics of R Programming 41
2.3.1 Assigning Values 42
2.3.2 Creating Vectors 43
2.4 R Object Types 43
2.5 Data Structures in R 45
2.5.1 Matrices 46
2.5.2 Arrays 47
2.5.3 Data Frames 48
2.5.4 Lists 50
2.5.5 Factors 51
2.6 Summary 52
Chapter 3: R for Data Analysis 53
3.1 Reading and Writing Data 53
3.1.1 Reading Data from a Text File 54
3.1.2 Reading Data from a Microsoft Excel File 58
3.1.3 Reading Data from the Web 60
3.2 Using Control Structures in R 61
3.2.1 if-else 62
3.2.2 for loops 62
3.2.3 while loops 63
3.2.4 Looping Functions 64
3.2.4.1 apply( ) 65
3.2.4.2 lapply( ) 66
3.2.4.3 sapply( ) 67
3.2.4.4 tapply( ) 67
3.2.4.5 cut( ) 69
3.2.4.6 split( ) 70
3.2.5 Writing Your Own Functions in R 71
3.3 Working with R Packages and Libraries 72
3.4 Summary 74
Chapter 4: Introduction to descriptive analytics 75
4.1 Descriptive analytics 78
4.2 Population and sample 78
4.3 Statistical parameters of interest 79
4.3.1 Mean 80
4.3.2 Median 82
4.3.3 Mode 84
4.3.4 Range 84
4.3.5 Quantiles 85
4.3.6 Standard deviation 86
4.3.7 Variance 89
4.3.8 “Summary” command in R 89
4.4 Graphical description of the data 90
4.4.1 Plots in R 90
4.4.2 Histogram 93
4.4.3 Bar plot 93
4.4.4 Boxplots 94
4.5 Computations on data frames 95
4.5.1 Scatter plot 97
4.6 Probability 100
4.6.1 Probability of mutually exclusive events 101
4.6.2 Probability of mutually independent events 101
4.6.3 Probability of mutually non-exclusive events: 102
4.6.4 Probability distributions 102
4.6.4.1 Normal distribution 102
4.6.4.2 Binomial distribution 103
4.6.4.3 Poisson distribution 104
4.7 Chapter summary 104
Chapter 5: Business Analytics Process and Data Exploration 106
5.1 Business Analytics Life Cycle 106
5.1.1 Phase 1: Understand the Business Problem 106
5.1.2 Phase 2: Collect and Integrate the Data 107
5.1.3 Phase 3: Preprocess the Data 107
5.1.4 Phase 4: Explore and Visualize the Data 107
5.1.5 Phase 5: Choose Modeling Techniques and Algorithms 108
5.1.6 Phase 6: Evaluate the Model 108
5.1.7 Phase 7: Report to Management and Review 109
5.1.8 Phase 8: Deploy the Model 109
5.2 Understanding the Business Problem 109
5.3 Collecting and Integrating the Data 110
5.3.1 Sampling 111
5.3.2 Variable Selection 112
5.4 Preprocessing the Data 112
5.4.1 Data Types 112
5.4.2 Data Preparation 114
5.4.2.1 Handling Missing Values 114
5.4.2.2 Handling Duplicates, Junk, and Null Values 115
5.4.3 Data Preprocessing with R 115
5.5 Exploring and Visualizing the Data 119
5.5.1 Tables 120
5.5.2 Summary Tables 121
5.5.3 Graphs 121
5.5.3.1 Box plots 122
5.5.3.2 Scatter plots 125
5.5.4 Scatter Plot Matrices 127
5.5.4.1 Trellis Plot 129
5.5.4.2 Correlation plot 130
5.5.4.3 Density by Class 131
5.5.5 Data Transformation 132
5.5.5.1 Normalization 132
5.6 Using Modeling Techniques and Algorithms 133
5.6.1 Descriptive Analytics 133
5.6.2 Predictive Analytics 133
5.6.3 Machine Learning 134
5.6.3.1 Supervised Machine Learning 134
5.6.3.2 Unsupervised Machine Learning 135
5.7 Evaluating the Model 137
5.7.1 Training Data Partition 137
5.7.2 Test Data Partition 137
5.7.3 Validation Data Partition 137
5.7.4 Cross-Validation 138
5.7.5 Classification Model Evaluation 138
5.7.5.1 Confusion Matrix 138
5.7.5.2 Lift Chart 140
5.7.5.3 ROC Chart 141
5.7.6 Regression Model Evaluation 142
5.7.6.1 Root-Mean-Square Error 142
5.8 Presenting a Management Report and Review 143
5.8.1 Problem Description 143
5.8.2 Data Set Used 143
5.8.3 Data Cleaning Carried Out 143
5.8.4 Method Used to Create the Model 143
5.8.5 Model Deployment Prerequisites 143
5.8.6 Model Deployment and Usage 144
5.8.7 Issues Handling 144
5.9 Deploying the Model 144
5.10 Summary 145
Chapter 6: Supervised Machine Learning—Classification 146
6.1 What Is Classification? What Is Prediction? 146
6.2 Probabilistic Models for Classification 147
6.2.1 Example 148
6.2.2 Naïve Bayes Classifier Using R 149
6.2.3 Advantages and Limitations of the Naïve Bayes Classifier 151
6.3 Decision Trees 151
6.3.1 Recursive Partitioning Decision-Tree Algorithm 153
6.3.2 Information Gain 153
6.3.3 Example of a Decision Tree 155
6.3.4 Induction of a Decision Tree 157
6.3.5 Classification Rules from Tree 160
6.3.6 Overfitting and Underfitting 160
6.3.7 Bias and Variance 162
6.3.8 Avoiding Overfitting Errors and Setting the Size of Tree Growth 163
6.3.8.1 Limiting Tree Growth 164
6.3.8.2 Pruning the Tree 165
6.4 Other Classifier Types 165
6.4.1 K-Nearest Neighbor 165
6.4.2 Random Forests 167
6.5 Classification Example Using R 168
6.6 Summary 175
Chapter 7: Unsupervised Machine Learning 176
7.1 Clustering - Overview 176
7.2 What Is Clustering? 178
7.2.1 Measures Between Two Records 178
7.2.1.1 Euclidean Distance and Manhattan Distance 178
7.2.1.2 Pearson Product Correlation (Statistical Measurement) 179
7.2.2 Distance Measures for Categorical Variables 179
7.2.3 Distance Measures for Mixed Data Types 180
7.2.4 Distance Between Two Clusters 181
7.2.4.1 Single Linkage (Minimum Distance) 181
7.2.4.2 Complete Linkage (Maximum Distance) 182
7.2.4.3 Average Linkage (Average Distance) 182
7.2.4.4 Centroid Distance 183
7.3 Hierarchical Clustering 183
7.3.1 Dendrograms 184
7.3.2 Limitations of Hierarchical Clustering 184
7.4 Nonhierarchical Clustering 184
7.4.1 K-Means Algorithm 185
7.4.2 Limitations of K-Means Clustering 187
7.5 Clustering Case Study 187
7.5.1 Retain Only Relevant Variables in the Data Set 188
7.5.2 Remove Any Outliers from the Data Set 188
7.5.3 Standardize the Data 189
7.5.4 Calculate the Distance Between the Data Points 190
7.5.4.1 Use the Selected Approaches to Carry Out the Clustering 190
7.5.4.2 Hierarchical Clustering Approach 190
7.5.4.3 Partition Clustering Approach 195
7.6 Association Rule 197
7.6.1 Choosing Rules 198
7.6.1.1 Support and Confidence 198
7.6.1.2 Lift 199
7.6.2 Example of Generating Association Rules 200
7.6.3 Interpreting Results 201
7.7 Summary 201
Chapter 8: Simple Linear Regression 202
8.1 Introduction 202
8.2 Correlation 203
8.2.1 Correlation Coefficient 204
8.3 Hypothesis Testing 207
8.4 Simple Linear Regression 208
8.4.1 Assumptions of Regression 208
8.4.2 Simple Linear Regression Equation 208
8.4.3 Creating Simple Regression Equation in R 209
8.4.4 Testing the Assumptions of Regression: 212
8.4.4.1 Test of Linearity 212
8.4.4.2 Test of Independence of Errors Around the Regression Line 213
8.4.4.3 Test of Normality 213
8.4.4.4 Equal variance of the distribution of the response variable 214
8.4.4.5 Other ways of validating the assumptions to be fulfilled by a Regression model 215
8.4.4.5.1 Using gvlma library 215
8.4.4.5.2 Using the Scale-Location Plot 216
8.4.4.5.3 Using crPlots(model name) function from library(car) 217
8.4.5 Conclusion 218
8.4.6 Predicting the Response Variable 218
8.4.7 Additional Notes 219
8.5 Chapter Summary 219
Chapter 9: Multiple Linear Regression 221
9.1 Using Multiple Linear Regression 223
9.1.1 The Data 223
9.1.2 Correlation 224
9.1.3 Arriving at the Model 226
9.1.4 Validation of the Assumptions of Regression 227
9.1.5 Multicollinearity 232
9.1.6 Stepwise Multiple Linear Regression 235
9.1.7 All Subsets Approach to Multiple Linear Regression 235
9.1.8 Multiple Linear Regression Equation 237
9.1.9 Conclusion 238
9.2 Using an Alternative Method in R 238
9.3 Predicting the Response Variable 239
9.4 Training and Testing the Model 239
9.5 Cross Validation 241
9.6 Summary 244
Chapter 10: Logistic Regression 246
10.1 Logistic Regression 248
10.1.1 The Data 248
10.1.2 Creating the Model 249
10.1.3 Model Fit Verification 253
10.1.4 General Words of Caution 254
10.1.5 Multicollinearity 255
10.1.6 Dispersion 255
10.1.7 Conclusion for Logistic Regression 255
10.2 Training and Testing the Model 256
10.2.1 Predicting the Response Variable 258
10.2.2 Alternative Way of Validating the Logistic Regression Model 258
10.3 Multinomial Logistic Regression 261
10.4 Regularization 261
10.5 Summary 267
Chapter 11: Big Data Analysis—Introduction and Future Trends 269
11.1 Big Data Ecosystem 271
11.2 Future Trends in Big Data Analytics 273
11.2.1 Growth of Social Media 273
11.2.2 Creation of Data Lakes 274
11.2.3 Visualization Tools at the Hands of Business Users 274
11.2.4 Prescriptive Analytics 274
11.2.5 Internet of Things 274
11.2.6 Artificial Intelligence 274
11.2.7 Whole Data Processing 275
11.2.8 Vertical and Horizontal Applications 275
11.2.9 Real-Time Analytics 275
11.2.10 Putting the Analytics in the Hands of Business Users 275
11.2.11 Migration of Solutions from One Tool to Another 275
11.2.12 Cloud, Cloud, Everywhere the Cloud 276
11.2.13 In-Database Analytics 276
11.2.14 In-Memory Analytics 276
11.2.15 Autonomous Services for Machine Learning 276
11.2.16 Addressing Security and Compliance 276
11.2.17 Healthcare 277
References 278
Index 284
Erscheint lt. Verlag | 27.12.2016 |
---|---|
Zusatzinfo | XVII, 280 p. 278 illus. |
Verlagsort | Berkeley |
Sprache | englisch |
Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
Mathematik / Informatik ► Informatik ► Netzwerke | |
Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge | |
Mathematik / Informatik ► Informatik ► Theorie / Studium | |
Mathematik / Informatik ► Mathematik ► Angewandte Mathematik | |
Naturwissenschaften | |
Schlagworte | Analytics • Business Analytics • Busniess • Data Mining • Descriptive Analytics • linear regression • Logistic Regression • predictive analytics • R |
ISBN-10 | 1-4842-2514-7 / 1484225147 |
ISBN-13 | 978-1-4842-2514-1 / 9781484225141 |
Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
Haben Sie eine Frage zum Produkt? |
Größe: 14,9 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich