Apache Spark 2: Data Processing and Real-Time Analytics

Master complex big data processing, stream analytics, and machine learning with Apache Spark

Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran (Autoren)

Buch | Softcover

616 Seiten

2018
Packt Publishing Limited (Verlag)
978-1-78995-920-8 (ISBN)

Artikel merken

Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing framework

Key Features

Master the art of real-time big data processing and machine learning
Explore a wide range of use-cases to analyze large data
Discover ways to optimize your work by using many features of Spark 2.x and Scala

Book DescriptionApache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform.

You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools.

By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle.

This Learning Path includes content from the following Packt products:

Mastering Apache Spark 2.x by Romeo Kienzler
Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla
Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook

What you will learn

Get to grips with all the features of Apache Spark 2.x
Perform highly optimized real-time big data processing
Use ML and DL techniques with Spark MLlib and third-party tools
Analyze structured and unstructured data using SparkSQL and GraphX
Understand tuning, debugging, and monitoring of big data applications
Build scalable and fault-tolerant streaming applications
Develop scalable recommendation engines

Who this book is forIf you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.

Romeo Kienzler works as the chief data scientist in the IBM Watson IoT worldwide team, helping clients to apply advanced machine learning at scale on their IoT sensor data. He holds a Master's degree in computer science from the Swiss Federal Institute of Technology, Zurich, with a specialization in information systems, bioinformatics, and applied statistics. Md. Rezaul Karim is a Research Scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Aachen, Germany. He has more than 8 years' experience in the area of research and development with a solid understanding of algorithms and data structures in C, C++, Java, Scala, R, and Python. Sridhar Alla is a big data expert helping companies solve complex problems in distributed computing, large scale data science and analytics practice. He holds a bachelor's in computer science from JNTU, India. He loves writing code in Python, Scala, and Java. He also has extensive hands-on knowledge of several Hadoop-based technologies, TensorFlow, NoSQL, IoT, and deep learning. Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP. Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale. Broderick Hall is a hands-on big data analytics expert and holds a master’s degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation. Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.

Table of Contents

A First Taste and What's New in Apache Spark V2
Apache Spark Streaming
Structured Streaming
Apache Spark MLlib
Apache SparkML
Apache SystemML
Apache Spark GraphX
Spark Tuning
Testing and Debugging Spark
Practical Machine Learning with Spark Using Scala
Spark's Three Data Musketeers for Machine Learning - Perfect Together
Common Recipes for Implementing a Robust Machine Learning System
Recommendation Engine that Scales with Spark
Unsupervised Clustering with Apache Spark 2.0
Implementing Text Analytics with Spark 2.0 ML Library
Spark Streaming and Machine Learning Library

Erscheinungsdatum	18.01.2019
Verlagsort	Birmingham
Sprache	englisch
Maße	75 x 93 mm
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
Themenwelt	Mathematik / Informatik ► Informatik ► Theorie / Studium
ISBN-10	1-78995-920-9 / 1789959209
ISBN-13	978-1-78995-920-8 / 9781789959208
Zustand	Neuware