Professional Hadoop Solutions

Boris Lublinsky, Kevin T. Smith, Alexey Yakubovich (Autoren)

Buch | Softcover

504 Seiten

2013
John Wiley & Sons Inc (Verlag)
978-1-118-61193-7 (ISBN)

Titel ist leider vergriffen;
keine Neuauflage

Artikel merken

The go-to guidebook for deploying Big Data solutions with Hadoop.

Today's enterprise architects need to understand how the Hadoop frameworks and APIs fit together, and how they can be integrated to deliver real-world solutions. This book is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth.

With in-depth code examples in Java and XML and the latest on recent additions to the Hadoop ecosystem, this complete resource also covers the use of APIs, exposing their inner workings and allowing architects and developers to better leverage and customize them.

The ultimate guide for developers, designers, and architects who need to build and deploy Hadoop applications
Covers storing and processing data with various technologies, automating data processing, Hadoop security, and delivering real-time solutions
Includes detailed, real-world examples and code-level guidelines
Explains when, why, and how to use these tools effectively
Written by a team of Hadoop experts in the programmer-to-programmer Wrox style

»Professional Hadoop Solutions« is the reference enterprise architects and developers need to maximize the power of Hadoop.

Boris Lublinsky is principal architect at Nokia and an author of more than 70 publications, including »Applied SOA: Service-Oriented Architecture and Design Strategies«.

Kevin T. Smith is Director of Technology Solutions for the AMS division of Novetta Solutions, where he builds highly secure, data-oriented solutions for customers.

Alexey Yakubovich is a system architect at Hortonworks and a member of the Object Management Group SIG on SOA governance and model-driven architecture.

Introduction xvii
Chapter 1: Big Data and the Hadoop Ecosystem 1
Big Data Meets Hadoop 2
Hadoop: Meeting the Big Data Challenge 3
Data Science in the Business World 5
The Hadoop Ecosystem 7
Hadoop Core Components 7
Hadoop Distributions 10
Developing Enterprise Applications with Hadoop 12
Summary 16
Chapter 2: Storing Data in Hadoop 19
HDFS 19
HDFS Architecture 20
Using HDFS Files 24
Hadoop-Specific File Types 26
HDFS Federation and High Availability 32
HBase 34
HBase Architecture 34
HBase Schema Design 40
Programming for HBase 42
New HBase Features 50
Combining HDFS and HBase for Effective Data Storage 53
Using Apache Avro 53
Managing Metadata with HCatalog 58
Choosing an Appropriate Hadoop Data Organization for Your Applications 60
Summary 62
Chapter 3: Processing Your Data with MapReduce 63
Getting to Know MapReduce 63
MapReduce Execution Pipeline 65
Runtime Coordination and Task Management in MapReduce 68
Your First MapReduce Application 70
Building and Executing MapReduce Programs 74
Designing MapReduce Implementations 78
Using MapReduce as a Framework for Parallel Processing 79
Simple Data Processing with MapReduce 81
Building Joins with MapReduce 82
Building Iterative MapReduce Applications 88
To MapReduce or Not to MapReduce? 94
Common MapReduce Design Gotchas 95
Summary 96
Chapter 4: Customizing MapReduce Execution 97
Controlling MapReduce Execution with InputFormat 98
Implementing InputFormat for Compute-Intensive Applications 100
Implementing InputFormat to Control the Number of Maps 106
Implementing InputFormat for Multiple HBase Tables 112
Reading Data Your Way with Custom RecordReaders 116
Implementing a Queue-Based RecordReader 116
Implementing RecordReader for XML Data 119
Organizing Output Data with Custom Output Formats 123
Implementing OutputFormat for Splitting MapReduce
Job’s Output into Multiple Directories 124
Writing Data Your Way with Custom RecordWriters 133
Implementing a RecordWriter to Produce Outputtar Files 133
Optimizing Your MapReduce Execution with a Combiner 135
Controlling Reducer Execution with Partitioners 139
Implementing a Custom Partitioner for One-to-Many Joins 140
Using Non-Java Code with Hadoop 143
Pipes 143
Hadoop Streaming 143
Using JNI 144
Summary 146
Chapter 5: Building Reliable MapReduce Apps 147
Unit Testing MapReduce Applications 147
Testing Mappers 150
Testing Reducers 151
Integration Testing 152
Local Application Testing with Eclipse 154
Using Logging for Hadoop Testing 156
Processing Applications Logs 160
Reporting Metrics with Job Counters 162
Defensive Programming in MapReduce 165
Summary 166
Chapter 6: Automating Data Processing with Oozie 167
Getting to Know Oozie 168
Oozie Workflow 170
Executing Asynchronous Activities in Oozie Workflow 173
Oozie Recovery Capabilities 179
Oozie Workflow Job Life Cycle 180
Oozie Coordinator 181
Oozie Bundle 187
Oozie Parameterization with Expression Language 191
Workflow Functions 192
Coordinator Functions 192
Bundle Functions 193
Other EL Functions 193
Oozie Job Execution Model 193
Accessing Oozie 197
Oozie SLA 199
Summary 203
Chapter 7: Using Oozie 205
Validating Information about Places Using Probes 206
Designing Place Validation Based on Probes 207
Designing Oozie Workflows 208
Implementing Oozie Workflow Applications 211
Implementing the Data Preparation Workflow 212
Implementing Attendance Index and Cluster Strands
Workflows 220
Implementing Workflow Activities 222
Populating the Execution Context from a java Action 223
Using MapReduce Jobs in Oozie Workflows 223
Implementing Oozie Coordinator Applications 226
Implementing Oozie Bundle Applications 231
Deploying, Testing, and Executing Oozie Applications 232
Deploying Oozie Applications 232
Using the Oozie CLI for Execution of an Oozie Application 234
Passing Arguments to Oozie Jobs 237
Using the Oozie Console to Get Information about Oozie
Applications 240
Getting to Know the Oozie Console Screens 240
Getting Information about a Coordinator Job 245
Summary 247
Chapter 8: Advanced Oozie FEATURES 249
Building Custom Oozie Workflow Actions 250
Implementing a Custom Oozie Workflow Action 251
Deploying Oozie Custom Workflow Actions 255
Adding Dynamic Execution to Oozie Workflows 257
Overall Implementation Approach 257
A Machine Learning Model, Parameters, and Algorithm 261
Defining a Workflow for an Iterative Process 262
Dynamic Workflow Generation 265
Using the Oozie Java API 268
Using Uber Jars with Oozie Applications 272
Data Ingestion Conveyer 276
Summary 283
Chapter 9: Real-Time Hadoop 285
Real-Time Applications in the Real World 286
Using HBase for Implementing Real-Time Applications 287
Using HBase as a Picture Management System 289
Using HBase as a Lucene Back End 296
Using Specialized Real-Time Hadoop Query Systems 317
Apache Drill 319
Impala 320
Comparing Real-Time Queries to MapReduce 323
Using Hadoop-Based Event-Processing Systems 323
HFlame 324
Storm 326
Comparing Event Processing to MapReduce 329
Summary 330
Chapter 10: Hadoop Security 331
A Brief History: Understanding Hadoop Security Challenges 333
Authentication 334
Kerberos Authentication 334
Delegated Security Credentials 344
Authorization 350
HDFS File Permissions 350
Service-Level Authorization 354
Job Authorization 356
Oozie Authentication and Authorization 356
Network Encryption 358
Security Enhancements with Project Rhino 360
HDFS Disk-Level Encryption 361
Token-Based Authentication and Unified Authorization Framework 361
HBase Cell-Level Security 362
Putting it All Together — Best Practices for Securing Hadoop 362
Authentication 363
Authorization 364
Network Encryption 364
Stay Tuned for Hadoop Enhancements 365
Summary 365
Chapter 11: Running Hadoop Applications on AWS 367
Getting to Know AWS 368
Options for Running Hadoop on AWS 369
Custom Installation using EC2 Instances 369
Elastic MapReduce 370
Additional Considerations before Making Your Choice 370
Understanding the EMR-Hadoop Relationship 370
EMR Architecture 372
Using S3 Storage 373
Maximizing Your Use of EMR 374
Utilizing CloudWatch and Other AWS Components 376
Accessing and Using EMR 377
Using AWS S3 383
Understanding the Use of Buckets 383
Content Browsing with the Console 386
Programmatically Accessing Files in S3 387
Using MapReduce to Upload Multiple Files to S3 397
Automating EMR Job Flow Creation and Job Execution 399
Orchestrating Job Execution in EMR 404
Using Oozie on an EMR Cluster 404
AWS Simple Workflow 407
AWS Data Pipeline 408
Summary 409
Chapter 12: Building Enterprise Security Solutions for Hadoop Implementations 411
Security Concerns for Enterprise Applications 412
Authentication 414
Authorization 414
Confidentiality 415
Integrity 415
Auditing 416
What Hadoop Security Doesn’t Natively Provide for Enterprise Applications 416
Data-Oriented Access Control 416
Differential Privacy 417
Encrypted Data at Rest 419
Enterprise Security Integration 419
Approaches for Securing Enterprise Applications Using Hadoop 419
Access Control Protection with Accumulo 420
Encryption at Rest 430
Network Isolation and Separation Approaches 430
Summary 434
Chapter 13: Hadoop’s Future 435
Simplifying MapReduce Programming with DSLs 436
What Are DSLs? 436
DSLs for Hadoop 437
Faster, More Scalable Processing 449
Apache YARN 449
Tez 452
Security Enhancements 452
Emerging Trends 453
Summary 454
APPENDIX : Useful Reading 455
Index 463

Verlagsort	New York
Sprache	englisch
Maße	187 x 234 mm
Gewicht	848 g
Einbandart	kartoniert
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Mathematik / Informatik ► Informatik ► Netzwerke
	Informatik ► Weitere Themen ► Hardware
ISBN-10	1-118-61193-4 / 1118611934
ISBN-13	978-1-118-61193-7 / 9781118611937
Zustand	Neuware