Data Analytics with Spark Using Python

Jeffrey Aven (Autor)

Buch | Softcover

320 Seiten

2018
Pearson Education (US) (Verlag)
978-0-13-484601-9 (ISBN)

Artikel merken

Solve Data Analytics Problems with Spark, PySpark, and Related Open Source Tools

Spark is at the heart of today’s Big Data revolution, helping data professionals supercharge efficiency and performance in a wide range of data processing and analytics tasks. In this guide, Big Data expert Jeffrey Aven covers all you need to know to leverage Spark, together with its extensions, subprojects, and wider ecosystem.

Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to large audiences of data professionals, analysts, and developers—even those with little Hadoop or Spark experience.

Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. You’ll learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems.

Coverage includes:
• Understand Spark’s evolving role in the Big Data and Hadoop ecosystems
• Create Spark clusters using various deployment modes
• Control and optimize the operation of Spark clusters and applications
• Master Spark Core RDD API programming techniques
• Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
• Efficiently integrate Spark with both SQL and nonrelational data stores
• Perform stream processing and messaging with Spark Streaming and Apache Kafka
• Implement predictive modeling with SparkR and Spark MLlib

Jeffrey Aven is an independent Big Data, open source software and cloud computing professional based out of Melbourne, Australia. Jeffrey is a highly regarded consultant and instructor and has authored several other books including Teach Yourself Apache Spark in 24 Hours and Teach Yourself Hadoop in 24 Hours.

Preface     xi
Introduction     1

PART I: SPARK FOUNDATIONS
Chapter 1 Introducing Big Data, Hadoop, and Spark     5
Introduction to Big Data, Distributed Computing, and Hadoop     5
     A Brief History of Big Data and Hadoop     6
     Hadoop Explained     7
Introduction to Apache Spark     13
     Apache Spark Background     13
     Uses for Spark     14
     Programming Interfaces to Spark     14
     Submission Types for Spark Programs     14
     Input/Output Types for Spark Applications     16
     The Spark RDD     16
     Spark and Hadoop     16
Functional Programming Using Python     17
     Data Structures Used in Functional Python Programming     17
     Python Object Serialization     20
     Python Functional Programming Basics     23
Summary     25
Chapter 2 Deploying Spark     27
Spark Deployment Modes     27
     Local Mode     28
     Spark Standalone     28
     Spark on YARN     29
     Spark on Mesos     30
Preparing to Install Spark     30
Getting Spark     31
Installing Spark on Linux or Mac OS X     32
Installing Spark on Windows     34
Exploring the Spark Installation     36
Deploying a Multi-Node Spark Standalone Cluster     37
Deploying Spark in the Cloud     39
     Amazon Web Services (AWS)     39
     Google Cloud Platform (GCP)     41
     Databricks     42
Summary     43
Chapter 3 Understanding the Spark Cluster Architecture     45
Anatomy of a Spark Application     45
     Spark Driver     46
     Spark Workers and Executors     49
     The Spark Master and Cluster Manager     51
Spark Applications Using the Standalone Scheduler     53
     Spark Applications Running on YARN     53
Deployment Modes for Spark Applications Running on YARN     53
     Client Mode     54
     Cluster Mode     55
     Local Mode Revisited     56
Summary     57
Chapter 4 Learning Spark Programming Basics     59
Introduction to RDDs     59
Loading Data into RDDs     61
     Creating an RDD from a File or Files     61
     Methods for Creating RDDs from a Text File or Files     63
     Creating an RDD from an Object File     66
     Creating an RDD from a Data Source     66
     Creating RDDs from JSON Files     69
     Creating an RDD Programmatically     71
Operations on RDDs     72
     Key RDD Concepts     72
     Basic RDD Transformations     77
     Basic RDD Actions     81
     Transformations on PairRDDs     85
     MapReduce and Word Count Exercise     92
     Join Transformations     95
     Joining Datasets in Spark     100
     Transformations on Sets     103
     Transformations on Numeric RDDs     105
Summary     108

PART II: BEYOND THE BASICS
Chapter 5 Advanced Programming Using the Spark Core API     111
Shared Variables in Spark     111
     Broadcast Variables     112
     Accumulators     116
     Exercise: Using Broadcast Variables and Accumulators     119
Partitioning Data in Spark     120
     Partitioning Overview     120
     Controlling Partitions     121
     Repartitioning Functions     123
     Partition-Specific or Partition-Aware API Methods     125
RDD Storage Options     127
     RDD Lineage Revisited     127
     RDD Storage Options     128
     RDD Caching     131
     Persisting RDDs     131
     Choosing When to Persist or Cache RDDs     134
     Checkpointing RDDs     134
     Exercise: Checkpointing RDDs     136
Processing RDDs with External Programs     138
Data Sampling with Spark     139
Understanding Spark Application and Cluster Configuration     141
     Spark Environment Variables     141
     Spark Configuration Properties     145
Optimizing Spark     148
     Filter Early, Filter Often     149
     Optimizing Associative Operations     149
     Understanding the Impact of Functions and Closures     151
     Considerations for Collecting Data     152
     Configuration Parameters for Tuning and Optimizing Applications     152
     Avoiding Inefficient Partitioning     153
     Diagnosing Application Performance Issues     155
Summary     159
Chapter 6 SQL and NoSQL Programming with Spark     161
Introduction to Spark SQL     161
     Introduction to Hive     162
     Spark SQL Architecture     166
     Getting Started with DataFrames     168
     Using DataFrames     179
     Caching, Persisting, and Repartitioning DataFrames     187
     Saving DataFrame Output     188
     Accessing Spark SQL     191
     Exercise: Using Spark SQL     194
Using Spark with NoSQL Systems     195
     Introduction to NoSQL     196
     Using Spark with HBase     197
     Exercise: Using Spark with HBase     200
     Using Spark with Cassandra     202
     Using Spark with DynamoDB     204
     Other NoSQL Platforms     206
Summary     206
Chapter 7 Stream Processing and Messaging Using Spark     209
Introducing Spark Streaming     209
     Spark Streaming Architecture     210
     Introduction to DStreams     211
     Exercise: Getting Started with Spark Streaming     218
     State Operations     219
     Sliding Window Operations     221
Structured Streaming     223
     Structured Streaming Data Sources     224
     Structured Streaming Data Sinks     225
     Output Modes     226
     Structured Streaming Operations     227
Using Spark with Messaging Platforms     228
     Apache Kafka     229
     Exercise: Using Spark with Kafka     234
     Amazon Kinesis     237
Summary     240
Chapter 8 Introduction to Data Science and Machine Learning Using Spark     243
Spark and R     243
     Introduction to R     244
     Using Spark with R     250
     Exercise: Using RStudio with SparkR     257
Machine Learning with Spark     259
     Machine Learning Primer     259
     Machine Learning Using Spark MLlib     262
     Exercise: Implementing a Recommender Using Spark MLlib     267
     Machine Learning Using Spark ML     271
Using Notebooks with Spark     275
     Using Jupyter (IPython) Notebooks with Spark     275
     Using Apache Zeppelin Notebooks with Spark     278
Summary     279
Index     281

Erscheinungsdatum	13.09.2018
Reihe/Serie	Addison-Wesley Data & Analytics Series
Verlagsort	Upper Saddle River
Sprache	englisch
Maße	100 x 100 mm
Gewicht	100 g
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
Themenwelt	Mathematik / Informatik ► Informatik ► Web / Internet
ISBN-10	0-13-484601-X / 013484601X
ISBN-13	978-0-13-484601-9 / 9780134846019
Zustand	Neuware