Multicore and GPU Programming - Gerassimos Barlas

Multicore and GPU Programming (eBook)

An Integrated Approach

Gerassimos Barlas (Autor)

eBook Download: PDF | EPUB

2014 | 1. Auflage
698 Seiten
Elsevier Science (Verlag)
978-0-12-417140-4 (ISBN)

Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore 'massively parallel' computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today's computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm. Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines. - Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA - Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance - Particular focus on the emerging area of divisible load theory and its impact on load balancing and distributed systems - Download source code, examples, and instructor support materials on the book's companion website

Gerassimos Barlas is a Professor with the Computer Science & Engineering Department, American University of Sharjah, Sharjah, UAE. His research interest includes parallel algorithms, development, analysis and modeling frameworks for load balancing, and distributed Video on-Demand. Prof. Barlas has taught parallel computing for more than 12 years, has been involved with parallel computing since the early 90s, and is active in the emerging field of Divisible Load Theory for parallel and distributed systems.

Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "e;massively parallel"e; computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today's computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm. Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines. - Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA- Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance- Particular focus on the emerging area of divisible load theory and its impact on load balancing and distributed systems- Download source code, examples, and instructor support materials on the book's companion website

Front Cover 1
Multicore and GPU Programming: An Integrated Approach 4
Copyright 5
Dedication 6
Contents 8
List of Tables 14
Preface 16
What Is in This Book 16
Using This Book as a Textbook 18
Software and Hardware Requirements 19
Sample Code 20
Chapter 1: Introduction 22
1.1 The era of multicore machines 22
1.2 A taxonomy of parallel machines 24
1.3 A glimpse of contemporary computing machines 26
1.3.1 The cell BE processor 27
1.3.2 Nvidia's Kepler 28
1.3.3 AMD's APUs 31
1.3.4 Multicore to many-core: tilera's TILE-Gx8072 and intel's xeon phi 32
1.4 Performance metrics 35
1.5 Predicting and measuring parallel program performance 39
1.5.1 Amdahl's law 42
1.5.2 Gustafson-barsis's rebuttal 45
Exercises 46
Chapter 2: Multicore and parallel program design 48
2.1 Introduction 48
2.2 The PCAM methodology 49
2.3 Decomposition patterns 53
2.3.1 Task parallelism 54
2.3.2 Divide-and-conquer decomposition 55
2.3.3 Geometric decomposition 57
2.3.4 Recursive data decomposition 60
2.3.5 Pipeline decomposition 63
2.3.6 Event-based coordination decomposition 67
2.4 Program structure patterns 68
2.4.1 Single-program, multiple-data 69
2.4.2 Multiple-program, multiple-data 69
2.4.3 Master-worker 70
2.4.4 Map-reduce 71
2.4.5 Fork/join 72
2.4.6 Loop parallelism 74
2.5 Matching decomposition patterns with program structure patterns 74
Exercises 75
Chapter 3: Shared-memory programming: threads 76
3.1 Introduction 76
3.2 Threads 79
3.2.1 What is a thread? 79
3.2.2 What are threads good for? 80
3.2.3 Thread creation and initialization 80
3.2.3.1 Implicit thread creation 84
3.2.4 Sharing data between threads 86
3.3 Design concerns 89
3.4 Semaphores 91
3.5 Applying semaphores in classical problems 96
3.5.1 Producers-consumers 96
3.5.2 Dealing with termination 100
3.5.2.1 Termination using a shared data item 100
3.5.2.2 Termination using messages 106
3.5.3 The barbershop problem: introducing fairness 111
3.5.4 Readers-writers 116
3.5.4.1 A solution favoring the readers 116
3.5.4.2 Giving priority to the writers 117
3.5.4.3 A fair solution 119
3.6 Monitors 120
3.6.1 Design approach 1: critical section inside the monitor 124
3.6.2 Design approach 2: monitor controls entry to critical section 125
3.7 Applying monitors in classical problems 128
3.7.1 Producers-consumers revisited 128
3.7.1.1 Producers-consumers: buffer manipulation within the monitor 128
3.7.1.2 Producers-consumers: buffer insertion/extraction exterior to the monitor 131
3.7.2 Readers-writers 134
3.7.2.1 A solution favoring the readers 134
3.7.2.2 Giving priority to the writers 136
3.7.2.3 A fair solution 137
3.8 Dynamic vs. static thread management 141
3.8.1 Qt's thread pool 141
3.8.2 Creating and managing a pool of threads 142
3.9 Debugging multithreaded applications 151
3.10 Higher-level constructs: multithreaded programming without threads 156
3.10.1 Concurrent map 157
3.10.2 Map-reduce 159
3.10.3 Concurrent filter 161
3.10.4 Filter-reduce 163
3.10.5 A case study: multithreaded sorting 164
3.10.6 A case study: multithreaded image matching 173
Exercises 180
Chapter 4: Shared-memory programming: OpenMP 186
4.1 Introduction 186
4.2 Your First OpenMP Program 187
4.3 Variable Scope 190
4.3.1 OpenMP Integration V.0: Manual Partitioning 192
4.3.2 OpenMP Integration V.1: Manual Partitioning Without a Race Condition 194
4.3.3 OpenMP Integration V.2: Implicit Partitioning with Locking 196
4.3.4 OpenMP Integration V.3: Implicit Partitioning with Reduction 197
4.3.5 Final Words on Variable Scope 199
4.4 Loop-Level Parallelism 200
4.4.1 Data Dependencies 202
4.4.1.1 Flow Dependencies 204
4.4.1.2 Antidependencies 211
4.4.1.3 Output Dependencies 211
4.4.2 Nested Loops 212
4.4.3 Scheduling 213
4.5 Task Parallelism 216
4.5.1 The sections Directive 217
4.5.1.1 Producers-Consumers in OpenMP 218
4.5.2 The task Directive 223
4.6 Synchronization Constructs 229
4.7 Correctness and Optimization Issues 237
4.7.1 Thread Safety 237
4.7.2 False Sharing 241
4.8 A Case Study: Sorting in OpenMP 247
4.8.1 Bottom-Up Mergesort in OpenMP 248
4.8.2 Top-Down Mergesort in OpenMP 251
4.8.3 Performance Comparison 256
Exercises 257
Chapter 5: Distributed memory programming 260
5.1 Communicating Processes 260
5.2 MPI 261
5.3 Core concepts 262
5.4 Your first MPI program 263
5.5 Program architecture 267
5.5.1 SPMD 267
5.5.2 MPMD 267
5.6 Point-to-Point communication 269
5.7 Alternative Point-to-Point communication modes 273
5.7.1 Buffered Communications 274
5.8 Non blocking communications 276
5.9 Point-to-Point Communications: Summary 280
5.10 Error reporting and handling 280
5.11 Collective communications 282
5.11.1 Scattering 287
5.11.2 Gathering 293
5.11.3 Reduction 295
5.11.4 All-to-All Gathering 300
5.11.5 All-to-All Scattering 304
5.11.6 All-to-All Reduction 309
5.11.7 Global Synchronization 310
5.12 Communicating objects 310
5.12.1 Derived Datatypes 311
5.12.2 Packing/Unpacking 318
5.13 Node management: communicators and groups 321
5.13.1 Creating Groups 321
5.13.2 Creating Intra-Communicators 323
5.14 One-sided communications 326
5.14.1 RMA Communication Functions 328
5.14.2 RMA Synchronization Functions 329
5.15 I/O considerations 338
5.16 Combining MPI processes with threads 346
5.17 Timing and Performance Measurements 349
5.18 Debugging and profiling MPI programs 350
5.19 The Boost.MPI library 354
5.19.1 Blocking and non blocking Communications 356
5.19.2 Data Serialization 361
5.19.3 Collective Operations 364
5.20 A case study: diffusion-limited aggregation 368
5.21 A case study: brute-force encryption cracking 373
5.21.1 Version #1 : "plain-vanilla'' MPI 373
5.21.2 Version #2 : combining MPI and OpenMP 379
5.22 A Case Study: MPI Implementation of the Master-Worker Pattern 383
5.22.1 A Simple Master-Worker Setup 384
5.22.2 A Multithreaded Master-Worker Setup 392
Exercises 407
Chapter 6: GPU programming 412
6.1 GPU Programming 412
6.2 CUDA's programming model: threads, blocks, and grids 415
6.3 CUDA's execution model: streaming multiprocessors and warps 421
6.4 CUDA compilation process 424
6.5 Putting together a CUDA project 428
6.6 Memory hierarchy 431
6.6.1 Local Memory/Registers 437
6.6.2 Shared Memory 438
6.6.3 Constant Memory 446
6.6.4 Texture and Surface Memory 453
6.7 Optimization techniques 453
6.7.1 Block and Grid Design 453
6.7.2 Kernel Structure 463
6.7.3 Shared Memory Access 467
6.7.4 Global Memory Access 475
6.7.5 Page-Locked and Zero-Copy Memory 479
6.7.6 Unified Memory 482
6.7.7 Asynchronous Execution and Streams 485
6.7.7.1 Stream Synchronization: Events and Callbacks 488
6.8 Dynamic parallelism 492
6.9 Debugging CUDA programs 496
6.10 Profiling CUDA programs 497
6.11 CUDA and MPI 501
6.12 Case studies 506
6.12.1 Fractal Set Calculation 507
6.12.1.1 Version #1: One thread per pixel 508
6.12.1.2 Version #2: Pinned host and pitched device memory 511
6.12.1.3 Version #3: Multiple pixels per thread 513
6.12.1.4 Evaluation 515
6.12.2 Block Cipher Encryption 517
6.12.2.1 Version #1: The case of a standalone GPU machine 524
6.12.2.2 Version #2: Overlapping GPU communication and computation 531
6.12.2.3 Version #3: Using a cluster of GPU machines 533
6.12.2.4 Evaluation 540
Exercises 544
Chapter 7: The Thrust template library 548
7.1 Introduction 548
7.2 First steps in Thrust 549
7.3 Working with Thrust datatypes 553
7.4 Thrust algorithms 556
7.4.1 Transformations 557
7.4.2 Sorting and searching 561
7.4.3 Reductions 567
7.4.4 Scans/prefix sums 569
7.4.5 Data management and manipulation 571
7.5 Fancy iterators 574
7.6 Switching device back ends 580
7.7 Case studies 582
7.7.1 Monte carlo integration 582
7.7.2 DNA Sequence alignment 585
Exercises 592
Chapter 8: Load balancing 596
8.1 Introduction 596
8.2 Dynamic load balancing: the Linda legacy 597
8.3 Static Load Balancing: The Divisible LoadTheory Approach 599
8.3.1 Modeling Costs 600
8.3.2 Communication Configuration 607
8.3.3 Analysis 610
8.3.3.1 N-Port, Block-Type, Single-Installment Solution 611
8.3.3.2 One-Port, Block-Type, Single-Installment Solution 613
8.3.4 Summary - Short Literature Review 619
8.4 DLTlib: A library for partitioning workloads 622
8.5 Case studies 625
8.5.1 Hybrid Computation of a Mandelbrot Set "Movie'':A Case Study in Dynamic Load Balancing 625
8.5.2 Distributed Block Cipher Encryption: A Case Study in Static Load Balancing 638
Appendix A: Compiling Qt programs 650
A.1 Using an IDE 650
A.2 The qmake Utility 650
Appendix B: Running MPI programs 652
B.1 Preparatory Steps 652
B.2 Computing Nodes discovery for MPI Program Deployment 653
B.2.1 Host Discovery with the nmap Utility 653
B.2.2 Automatic Generation of a Hostfile 654
Appendix C: Time measurement 656
C.1 Introduction 656
C.2 POSIX High-Resolution Timing 656
C.3 Timing in Qt 658
C.4 Timing in OpenMP 659
C.5 Timing in MPI 659
C.6 Timing in CUDA 659
Appendix D: Boost.MPI 662
D.1 Mapping from MPI C to Boost.MPI 662
Appendix E: Setting up CUDA 664
E.1 Installation 664
E.2 Issues with GCC 664
E.3 Running CUDA without an Nvidia GPU 665
E.4 Running CUDA on Optimus-Equipped Laptops 666
E.5 Combining CUDA with Third-Party Libraries 667
Appendix F: DLTlib 670
F.1 DLTlib Functions 670
F.1.1 Class Network: Generic Methods 671
F.1.2 Class Network: Query Processing 673
F.1.3 Class Network: Image Processing 674
F.1.4 Class Network: Image Registration 675
F.2 DLTlib Files 678
Glossary 680
Bibliography 682
Index 686

Preface

Parallel computing has been given a fresh breath of life since the emergence of multicore architectures in the first decade of the new century. The new platforms demand a new approach to software development; one that blends the tools and established practices of the network-of-workstations era with emerging software platforms such as CUDA.

This book tries to address this need by covering the dominant contemporary tools and techniques, both in isolation and also most importantly in combination with each other. We strive to provide examples where multiple platforms and programming paradigms (e.g., message passing & threads) are effectively combined. “Hybrid” computation, as it is usually called, is a new trend in high-performance computing, one that could possibly allow software to scale to the “millions of threads” required for exascale performance.

All chapters are accompanied by extensive examples and practice problems with an emphasis on putting them to work, while comparing alternative design scenarios. All the little details, which can make the difference between a productive software development and a stressed exercise in futility, are presented in a orderly fashion.

The book covers the latest advances in tools that have been inherited from the 1990s (e.g., the OpenMP and MPI standards), but also more cutting-edge platforms, such as the Qt library with its sophisticated thread management and the Thrust template library with its capability to deploy the same software over diverse multicore architectures, including both CPUs and Graphical Processing Units (GPUs).

We could never accomplish the feat of covering all the tools available for multicore development today. Even some of the industry-standard ones, like POSIX threads, are omitted.

Our goal is to both sample the dominant paradigms (ranging from OpenMP’s semi-automatic parallelization of sequential code to the explicit communication “plumping” that underpins MPI), while at the same time explaining the rationale and how-to, behind efficient multicore program development.

What is in this Book

This book can be separated in the following logical units, although no such distinction is made in the text:

• Introduction, designing multicore software: Chapter 1 introduces multicore hardware and examines influential instances of this architectural paradigm. Chapter 1 also introduces speedup and efficiency, which are essential metrics used in the evaluation of multicore and parallel software. Amdahl’s law and Gustafson-Barsis’s rebuttal cap-up the chapter, providing estimates of what can be expected from the exciting new developments in multicore and many-core hardware.
Chapter 2 is all about the methodology and the design patterns that can be employed in the development of parallel and multicore software. Both work decomposition patterns and program structure patterns are examined.

• Shared-memory programming: Two different approaches for shared-memory parallel programming are examined: explicit and implicit parallelization. On the explicit side, Chapter 3 covers threads and two of the most commonly used synchronization mechanisms, semaphores and monitors. Frequently encountered design patterns, such as producers-consumers and readers-writers, are explained thoroughly and applied in a range of examples. On the implicit side, Chapter 4 covers the OpenMP standard that has been specifically designed for parallelizing existing sequential code with minimum effort. Development time is significantly reduced as a result. There are still complications, such as loop-carried dependencies, which are also addressed.

• Distributed memory programming: Chapter 5 introduces the de facto standard for distributed memory parallel programming, i.e., the Message Passing Interface (MPI). MPI is relevant to multicore programming as it is designed to scale from a shared-memory multicore machine to a million-node supercomputer. As such, MPI provides the foundation for utilizing multiple disjoint multicore machines, as a single virtual platform.
The features that are covered include both point-to-point and collective communication, as well as one-sided communication. A section is dedicated to the Boost.MPI library, as it does simplify the proceedings of using MPI, although it is not yet feature-complete.

• GPU programming: GPUs are one of the primary reasons why this book was put together. In a similar fashion to shared-memory programming, we examine the problem of developing GPU-specific software from two perspectives: on one hand we have the “nuts-and-bolts” approach of Nvidia’s CUDA, where memory transfers, data placement, and thread execution configuration have to be carefully planned. CUDA is examined in Chapter 6.
On the other hand, we have the high-level, algorithmic approach of the Thrust template library, which is covered in Chapter 7. The STL-like approach to program design affords Thrust the ability to target both CPUs and GPU platforms, a unique feature among the tools we cover.

• Load balancing : Chapter 8 is dedicated to an often under-estimated aspect of multicore development. In general, load balancing has to be seriously considered once heterogeneous computing resources come into play. For example, a CPU and a GPU constitute such a set of resources, so we should not think only of clusters of dissimilar machines as fitting this requirement. Chapter 8 briefly discusses the Linda coordination language, which can be considered a high-level abstraction of dynamic load balancing.
The main focus is on static load balancing and the mathematical models that can be used to drive load partitioning and data communication sequences.
A well-established methodology known as Divisible Load Theory (DLT) is explained and applied in a number of scenarios. A simple C++ library that implements parts of the DLT research results, which have been published over the past two decades, is also presented.

Using this Book as a Textbook

The material covered in this book is appropriatefor senior undergraduateor postgraduate course work. The required student background includes programming in C, C++ (both languages are used throughout this book), basic operating system concepts, and at least elementary knowledge of computer architecture.

Depending on the desired focus, an instructor may choose to follow one of the suggested paths listed below. The first two chapters lay the foundations for the other chapters, so they are included in all sequences:

• Emphasis on parallel programming (undergraduate):

• Chapter 1: Flynn’s taxonomy, contemporary multicore machines, performance metrics. Sections: 1.1–1.5.

• Chapter 2: Design, PCAM methodology, decomposition patterns, program structure patterns. Sections 2.1–2.5.

• Chapter 3: Threads, semaphores, monitors. Sections 3.1–3.7.

• Chapter 4: OpenMP basics, work-sharing constructs. Sections 4.1–4.8.

• Chapter 5: MPI, point-to-point communications, collective operations, bject/structure communications, debugging and profiling. Sections 5.1–5.12, 5.15–5.18, 5.20.

• Chapter 6: CUDA programming model, memory hierarchy, GPU-pecific optimizations. Sections 6.1–6.2, 6.7.1, 6.7.3, 6.7.6, 6.9–6.11, 6.12.1.

• Chapter 7: Thrust basics. Sections 7.1–7.4.

• Chapter 8: Load balancing. Sections 8.1–8.3.

• Emphasis on multicore programming (undergraduate):

• Chapter 1: Flynn’s taxonomy, contemporary multicore machines, performance metrics. Sections 1.1–1.5.

• Chapter 2: Design, PCAM methodology, decomposition patterns, program structure patterns. Sections 2.1–2.5.

• Chapter 3: Threads, semaphores, monitors. Sections 3.1–3.7.

• Chapter 4: OpenMP basics, work-sharing constructs, correctness and performance issues. Sections 4.1–4.4.

• Chapter 5: MPI, point-to-point communications, collective operations, debugging and profiling. Sections 5.1–5.12, 5.16–5.18, 5.21.

• Chapter 6: CUDA programming model, memory hierarchy, GPU-specific optimizations. Sections 6.1–6.10,...

Erscheint lt. Verlag	16.12.2014
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge
Themenwelt	Mathematik / Informatik ► Informatik ► Theorie / Studium
ISBN-10	0-12-417140-0 / 0124171400
ISBN-13	978-0-12-417140-4 / 9780124171404

Haben Sie eine Frage zum Produkt?

PDF (Adobe DRM)
Größe: 25,4 MB

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

EPUB (Adobe DRM)
Größe: 33,2 MB

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

CHF 129,15