Big Data Analytics with Microsoft HDInsight in 24 Hours, Sams Teach Yourself
Sams Publishing (Verlag)
978-0-672-33727-7 (ISBN)
- Titel ist leider vergriffen;
keine Neuauflage - Artikel merken
In just 24 lessons of one hour or less, Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours helps you leverage Hadoop’s power on a flexible, scalable cloud platform using Microsoft’s newest business intelligence, visualization, and productivity tools.
This book’s straightforward, step-by-step approach shows you how to provision, configure, monitor, and troubleshoot HDInsight and use Hadoop cloud services to solve real analytics problems. You’ll gain more of Hadoop’s benefits, with less complexity–even if you’re completely new to Big Data analytics. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success.
Practical, hands-on examples show you how to apply what you learn
Quizzes and exercises help you test your knowledge and stretch your skills
Notes and tips point out shortcuts and solutions
Learn how to…
· Master core Big Data and NoSQL concepts, value propositions, and use cases
· Work with key Hadoop features, such as HDFS2 and YARN
· Quickly install, configure, and monitor Hadoop (HDInsight) clusters in the cloud
· Automate provisioning, customize clusters, install additional Hadoop projects, and administer clusters
· Integrate, analyze, and report with Microsoft BI and Power BI
· Automate workflows for data transformation, integration, and other tasks
· Use Apache HBase on HDInsight
· Use Sqoop or SSIS to move data to or from HDInsight
· Perform R-based statistical computing on HDInsight datasets
· Accelerate analytics with Apache Spark
· Run real-time analytics on high-velocity data streams
· Write MapReduce, Hive, and Pig programs
Register your book at informit.com/register for convenient access to downloads, updates, and corrections as they become available.
Arshad Ali has more than 13 years of experience in the computer industry. As a DB/DW/BI consultant in an end-to-end delivery role, he has been working on several enterprise-scale data warehousing and analytics projects for enabling and developing business intelligence and analytic solutions. He specializes in database, data warehousing, and business intelligence/analytics application design, development, and deployment at the enterprise level. He frequently works with SQL Server, Microsoft Analytics Platform System (APS, or formally known as SQL Server Parallel Data Warehouse [PDW]), HDInsight (Hadoop, Hive, Pig, HBase, and so on), SSIS, SSRS, SSAS, Service Broker, MDS, DQS, SharePoint, and PPS. In the past, he has also handled performance optimization for several projects, with significant performance gain. Arshad is a Microsoft Certified Solutions Expert (MCSE)–SQL Server 2012 Data Platform, and Microsoft Certified IT Professional (MCITP) in Microsoft SQL Server 2008–Database Development, Data Administration, and Business Intelligence. He is also certified on ITIL 2011 foundation. He has worked in developing applications in VB, ASP, .NET, ASP.NET, and C#. He is a Microsoft Certified Application Developer (MCAD) and Microsoft Certified Solution Developer (MCSD) for the .NET platform in Web, Windows, and Enterprise. Arshad has presented at several technical events and has written more than 200 articles related to DB, DW, BI, and BA technologies, best practices, processes, and performance optimization techniques on SQL Server, Hadoop, and related technologies. His articles have been published on several prominent sites. On the educational front, Arshad holds a Master in Computer Applications degree and a Master in Business Administration in IT degree. Arshad can be reached at arshad.ali@live.in, or visit http://arshadali.blogspot.in/ to connect with him. Manpreet Singh is a consultant and author with extensive expertise in architecture, design, and implementation of business intelligence and Big Data analytics solutions. He is passionate about enabling businesses to derive valuable insights from their data. Manpreet has been working on Microsoft technologies for more than 8 years, with a strong focus on Microsoft Business Intelligence Stack, SharePoint BI, and Microsoft’s Big Data Analytics Platforms (Analytics Platform System and HDInsight). He also specializes in Mobile Business Intelligence solution development and has helped businesses deliver a consolidated view of their data to their mobile workforces. Manpreet has coauthored books and technical articles on Microsoft technologies, focusing on the development of data analytics and visualization solutions with the Microsoft BI Stack and SharePoint. He holds a degree in computer science and engineering from Panjab University, India. Manpreet can be reached at manpreet.singh3@hotmail.com.
Introduction
Part I: Understanding Big Data, Hadoop 1.0, and 2.0
Hour 1: Introduction of Big Data, NoSQL, and Business Value Proposition
Types of Analysis
Types of Data
Big Data
Managing Big Data
NoSQL Systems
Big Data, NoSQL Systems, and the Business Value Proposition
Application of Big Data and Big Data Solutions
Summary
Q&A
Hour 2: Introduction to Hadoop, Its Architecture, Ecosystem, and Microsoft Offerings
What Is Apache Hadoop?
Architecture of Hadoop and Hadoop Ecosystems
What’s New in Hadoop 2.0
Architecture of Hadoop 2.0
Tools and Technologies Needed with Big Data Analytics
Major Players and Vendors for Hadoop
Deployment Options for Microsoft Big Data Solutions
Summary
Q&A
Hour 3: Hadoop Distributed File System Versions 1.0 and 2.0
Introduction to HDFS
HDFS Architecture
Rack Awareness
WebHDFS
Accessing and Managing HDFS Data
What’s New in HDFS 2.0
Summary
Q&A
Hour 4: The MapReduce Job Framework and Job Execution Pipeline
Introduction to MapReduce
MapReduce Architecture
MapReduce Job Execution Flow
Summary
Q&A
Hour 5: MapReduce–Advanced Concepts and YARN
DistributedCache
Hadoop Streaming
MapReduce Joins
Bloom Filter
Performance Improvement
Handling Failures
Counter
YARN
Uber-Tasking Optimization
Failures in YARN
Resource Manager High Availability and Automatic Failover in YARN
Summary
Q&A
Part II: Getting Started with HDInsight and Understanding Its Different Components
Hour 6: Getting Started with HDInsight, Provisioning Your HDInsight Service Cluster, and Automating HDInsight Cluster Provisioning
Introduction to Microsoft Azure
Understanding HDInsight Service
Provisioning HDInsight on the Azure Management Portal
Automating HDInsight Provisioning with PowerShell
Managing and Monitoring HDInsight Cluster and Job Execution
Summary
Q&A
Exercise
Hour 7: Exploring Typical Components of HDFS Cluster
HDFS Cluster Components
HDInsight Cluster Architecture
High Availability in HDInsight
Summary
Q&A
Hour 8: Storing Data in Microsoft Azure Storage Blob
Understanding Storage in Microsoft Azure
Benefits of Azure Storage Blob over HDFS
Azure Storage Explorer Tools
Summary
Q&A
Hour 9: Working with Microsoft Azure HDInsight Emulator
Getting Started with HDInsight Emulator
Setting Up Microsoft Azure Emulator for Storage
Summary
Q&A
Part III: Programming MapReduce and HDInsight Script Action
Hour 10: Programming MapReduce Jobs
MapReduce Hello World!
Analyzing Flight Delays with MapReduce
Serialization Frameworks for Hadoop
Hadoop Streaming
Summary
Q&A
Hour 11: Customizing the HDInsight Cluster with Script Action
Identifying the Need for Cluster Customization
Developing Script Action
Consuming Script Action
Running a Giraph job on a Customized HDInsight Cluster
Testing Script Action with HDInsight Emulator
Summary
Q&A
Part IV: Querying and Processing Big Data in HDInsight
Hour 12: Getting Started with Apache Hive and Apache Tez in HDInsight
Introduction to Apache Hive
Getting Started with Apache Hive in HDInsight
Azure HDInsight Tools for Visual Studio
Programmatically Using the HDInsight .NET SDK
Introduction to Apache Tez
Summary
Q&A
Exercise
Hour 13: Programming with Apache Hive, Apache Tez in HDInsight, and Apache HCatalog
Programming with Hive in HDInsight
Using Tables in Hive
Serialization and Deserialization
Data Load Processes for Hive Tables
Querying Data from Hive Tables
Indexing in Hive
Apache Tez in Action
Apache HCatalog
Summary
Q&A
Exercise
Hour 14: Consuming HDInsight Data from Microsoft BI Tools over Hive ODBC Driver: Part 1
Introduction to Hive ODBC Driver
Introduction to Microsoft Power BI
Accessing Hive Data from Microsoft Excel
Summary
Q&A
Hour 15: Consuming HDInsight Data from Microsoft BI Tools over Hive ODBC Driver: Part 2
Accessing Hive Data from PowerPivot
Accessing Hive Data from SQL Server
Accessing HDInsight Data from Power Query
Summary
Q&A
Exercise
Hour 16: Integrating HDInsight with SQL Server Integration Services
The Need for Data Movement
Introduction to SSIS
Analyzing On-time Flight Departure with SSIS
Provisioning HDInsight Cluster
Summary
Q&A
Hour 17: Using Pig for Data Processing
Introduction to Pig Latin
Using Pig to Count Cancelled Flights
Using HCatalog in a Pig Latin Script
Submitting Pig Jobs with PowerShell
Summary
Q&A
Hour 18: Using Sqoop for Data Movement Between RDBMS and HDInsight
What Is Sqoop?
Using Sqoop Import and Export Commands
Using Sqoop with PowerShell
Summary
Q&A
Part V: Managing Workflow and Performing Statistical Computing
Hour 19: Using Oozie Workflows and Job Orchestration with HDInsight
Introduction to Oozie
Determining On-time Flight Departure Percentage with Oozie
Submitting an Oozie Workflow with HDInsight .NET SDK
Coordinating Workflows with Oozie
Oozie Compared to SSIS
Summary
Q&A
Hour 20: Performing Statistical Computing with R
Introduction to R
Integrating R with Hadoop
Enabling R on HDInsight
Summary
Q&A
Part VI: Performing Interactive Analytics and Machine Learning
Hour 21: Performing Big Data Analytics with Spark
Introduction to Spark
Spark Programming Model
Blending SQL Querying with Functional Programs
Summary
Q&A
Hour 22: Microsoft Azure Machine Learning
History of Traditional Machine Learning
Introduction to Azure ML
Azure ML Workspace
Processes to Build Azure ML Solutions
Getting Started with Azure ML
Creating Predictive Models with Azure ML
Publishing Azure ML Models as Web Services
Summary
Q&A
Exercise
Part VII: Performing Real-time Analytics
Hour 23: Performing Stream Analytics with Storm
Introduction to Storm
Using SCP.NET to Develop Storm Solutions
Analyzing Speed Limit Violation Incidents with Storm
Summary
Q&A
Hour 24: Introduction to Apache HBase on HDInsight
Introduction to Apache HBase
HBase Architecture
Creating HDInsight Cluster with HBase
Summary
Q&A
9780672337277 TOC 10/26/2015
Erscheint lt. Verlag | 19.11.2015 |
---|---|
Verlagsort | Indianapolis |
Sprache | englisch |
Maße | 181 x 230 mm |
Gewicht | 916 g |
Themenwelt | Mathematik / Informatik ► Informatik ► Datenbanken |
Informatik ► Weitere Themen ► Zertifizierung | |
ISBN-10 | 0-672-33727-4 / 0672337274 |
ISBN-13 | 978-0-672-33727-7 / 9780672337277 |
Zustand | Neuware |
Haben Sie eine Frage zum Produkt? |
aus dem Bereich