David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow. At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers. Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological 'evangelist' who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide.
Programming Massively Parallel Processors: A Hands-on Approach, Second Edition, teaches students how to program massively parallel processors. It offers a detailed discussion of various techniques for constructing parallel programs. Case studies are used to demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. This guide shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. This revised edition contains more parallel programming examples, commonly-used libraries such as Thrust, and explanations of the latest tools. It also provides new coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more; increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism; and two new case studies (on MRI reconstruction and molecular visualization) that explore the latest applications of CUDA and GPUs for scientific research and high-performance computing. This book should be a valuable resource for advanced students, software engineers, programmers, and hardware engineers. - New coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more- Increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism- Two new case studies (on MRI reconstruction and molecular visualization) explore the latest applications of CUDA and GPUs for scientific research and high-performance computing
Front Cover 1
Programming Massively Parallel Processors 4
Copyright Page 5
Contents 6
Preface 14
Target Audience 15
How to Use the Book 15
A Three-Phased Approach 16
Tying It All Together: The Final Project 16
Project Workshop 17
Design Document 17
Project Report 18
Online Supplements 18
Acknowledgements 20
Dedication 22
1 Introduction 24
1.1 Heterogeneous Parallel Computing 25
1.2 Architecture of a Modern GPU 31
1.3 Why More Speed or Parallelism? 33
1.4 Speeding Up Real Applications 35
1.5 Parallel Programming Languages and Models 37
1.6 Overarching Goals 39
1.7 Organization of the Book 40
References 44
2 History of GPU Computing 46
2.1 Evolution of Graphics Pipelines 46
The Era of Fixed-Function Graphics Pipelines 47
Evolution of Programmable Real-Time Graphics 51
Unified Graphics and Computing Processors 54
2.2 GPGPU: An Intermediate Step 56
2.3 GPU Computing 57
Scalable GPUs 58
Recent Developments 59
Future Trends 60
References and Further Reading 60
3 Introduction to Data Parallelism and CUDA C 64
3.1 Data Parallelism 65
3.2 CUDA Program Structure 66
3.3 A Vector Addition Kernel 68
3.4 Device Global Memory and Data Transfer 71
3.5 Kernel Functions and Threading 76
3.6 Summary 82
Function Declarations 82
Kernel Launch 82
Predefined Variables 82
Runtime API 83
3.7 Exercises 83
References 85
4 Data-Parallel Execution Model 86
4.1 Cuda Thread Organization 87
4.2 Mapping Threads to Multidimensional Data 91
4.3 Matrix-Matrix Multiplication—A More Complex Kernel 97
4.4 Synchronization and Transparent Scalability 104
4.5 Assigning Resources to Blocks 106
4.6 Querying Device Properties 108
4.7 Thread Scheduling and Latency Tolerance 110
4.8 Summary 114
4.9 Exercises 114
5 CUDA Memories 118
5.1 Importance of Memory Access Efficiency 119
5.2 CUDA Device Memory Types 120
5.3 A Strategy for Reducing Global Memory Traffic 128
5.4 A Tiled Matrix–Matrix Multiplication Kernel 132
5.5 Memory as a Limiting Factor to Parallelism 138
5.6 Summary 141
5.7 Exercises 142
6 Performance Considerations 146
6.1 Warps and Thread Execution 147
6.2 Global Memory Bandwidth 155
6.3 Dynamic Partitioning of Execution Resources 164
6.4 Instruction Mix and Thread Granularity 166
6.5 Summary 168
6.6 Exercises 168
References 172
7 Floating-Point Considerations 174
7.1 Floating-Point Format 175
Normalized Representation of M 175
Excess Encoding of E 176
7.2 Representable Numbers 178
7.3 Special Bit Patterns and Precision in Ieee Format 183
7.4 Arithmetic Accuracy and Rounding 184
7.5 Algorithm Considerations 185
7.6 Numerical Stability 187
7.7 Summary 192
7.8 Exercises 193
References 194
8 Parallel Patterns: Convolution 196
8.1 Background 197
8.2 1D Parallel Convolution—A Basic Algorithm 202
8.3 Constant Memory and Caching 204
8.4 Tiled 1D Convolution with Halo Elements 208
8.5 A Simpler Tiled 1D Convolution—General Caching 215
8.6 Summary 216
8.7 Exercises 217
9 Parallel Patterns: Prefix Sum 220
9.1 Background 221
9.2 A Simple Parallel Scan 223
9.3 Work Efficiency Considerations 227
9.4 A Work-Efficient Parallel Scan 228
9.5 Parallel Scan for Arbitrary-Length Inputs 233
9.6 Summary 237
9.7 Exercises 238
Reference 239
10 Parallel Patterns: Sparse Matrix–Vector Multiplication 240
10.1 Background 241
10.2 Parallel SpMV Using CSR 245
10.3 Padding and Transposition 247
10.4 Using Hybrid to Control Padding 249
10.5 Sorting and Partitioning for Regularization 253
10.6 Summary 255
10.7 Exercises 256
References 257
11 Application Case Study: Advanced MRI Reconstruction 258
11.1 Application Background 259
11.2 Iterative Reconstruction 262
11.3 Computing FHD 264
Step 1: Determine the Kernel Parallelism Structure 266
Step 2: Getting Around the Memory Bandwidth Limitation 272
Step 3: Using Hardware Trigonometry Functions 278
Step 4: Experimental Performance Tuning 282
11.4 Final Evaluation 283
11.5 Exercises 285
References 287
12 Application Case Study: Molecular Visualization and Analysis 288
12.1 Application Background 289
12.2 A Simple Kernel Implementation 291
12.3 Thread Granularity Adjustment 295
12.4 Memory Coalescing 297
12.5 Summary 300
12.6 Exercises 302
References 302
13 Parallel Programming and Computational Thinking 304
13.1 Goals of Parallel Computing 305
13.2 Problem Decomposition 306
13.3 Algorithm Selection 310
13.4 Computational Thinking 316
13.5 Summary 317
13.6 Exercises 317
References 318
14 An Introduction to OpenCL™ 320
14.1 Background 320
14.2 Data Parallelism Model 322
14.3 Device Architecture 324
14.4 Kernel Functions 326
14.5 Device Management and Kernel Launch 327
14.6 Electrostatic Potential Map in Opencl 330
14.7 Summary 334
14.8 Exercises 335
References 336
15 Parallel Programming with OpenACC 338
15.1 OpenACC Versus CUDA C 338
15.2 Execution Model 341
15.3 Memory Model 342
15.4 Basic OpenACC Programs 343
Parallel Construct 343
Parallel Region, Gangs, and Workers 343
Loop Construct 345
Gang Loop 345
Worker Loop 346
OpenACC Versus CUDA 346
Vector Loop 349
Kernels Construct 350
Prescriptive Versus Descriptive 350
Ways to Help an OpenACC Compiler 352
Data Management 354
Data Clauses 354
Data Construct 355
Asynchronous Computation and Data Transfer 358
15.5 Future Directions of OpenACC 359
15.6 Exercises 360
16 Thrust: A Productivity-Oriented Library for CUDA 362
16.1 Background 362
16.2 Motivation 365
16.3 Basic Thrust Features 366
Iterators and Memory Space 367
Interoperability 368
16.4 Generic Programming 370
16.5 Benefits of Abstraction 372
16.6 Programmer Productivity 372
Robustness 373
Real-World Performance 373
16.7 Best Practices 375
Fusion 376
Structure of Arrays 377
Implicit Ranges 379
16.8 Exercises 380
References 381
17 CUDA FORTRAN 382
17.1 CUDA FORTRAN and CUDA C Differences 383
17.2 A First CUDA FORTRAN Program 384
17.3 Multidimensional Array in CUDA FORTRAN 386
17.4 Overloading Host/Device Routines With Generic Interfaces 387
17.5 Calling CUDA C Via Iso_C_Binding 390
17.6 Kernel Loop Directives and Reduction Operations 392
17.7 Dynamic Shared Memory 393
17.8 Asynchronous Data Transfers 394
17.9 Compilation and Profiling 400
17.10 Calling Thrust from CUDA FORTRAN 401
17.11 Exercises 405
18 An Introduction to C++ AMP 406
18.1 Core C++ Amp Features 407
18.2 Details of the C++ AMP Execution Model 414
Explicit and Implicit Data Copies 414
Asynchronous Operation 416
Section Summary 418
18.3 Managing Accelerators 418
18.4 Tiled Execution 421
18.5 C++ AMP Graphics Features 424
18.6 Summary 428
18.7 Exercises 428
19 Programming a Heterogeneous Computing Cluster 430
19.1 Background 431
19.2 A Running Example 431
19.3 MPI Basics 433
19.4 MPI Point-to-Point Communication Types 437
19.5 Overlapping Computation and Communication 444
19.6 MPI Collective Communication 454
19.7 Summary 454
19.8 Exercises 455
Reference 456
20 CUDA Dynamic Parallelism 458
20.1 Background 459
20.2 Dynamic Parallelism Overview 461
20.3 Important Details 462
Launch Environment Configuration 462
API Errors and Launch Failures 462
Events 462
Streams 463
Synchronization Scope 464
20.4 Memory Visibility 465
Global Memory 465
Zero-Copy Memory 465
Constant Memory 465
Local Memory 465
Shared Memory 466
Texture Memory 466
20.5 A Simple Example 467
20.6 Runtime Limitations 469
Memory Footprint 469
Nesting Depth 471
Memory Allocation and Lifetime 471
ECC Errors 472
Streams 472
Events 472
Launch Pool 472
20.7 A More Complex Example 472
Linear Bezier Curves 473
Quadratic Bezier Curves 473
Bezier Curve Calculation (Predynamic Parallelism) 473
Bezier Curve Calculation (with Dynamic Parallelism) 476
20.8 Summary 479
Reference 480
21 Conclusion and Future Outlook 482
21.1 Goals Revisited 482
21.2 Memory Model Evolution 484
21.3 Kernel Execution Control Evolution 487
21.4 Core Performance 490
21.5 Programming Environment 490
21.6 Future Outlook 491
References 492
Appendix A: Matrix Multiplication Host-Only Version Source Code 494
A.1 matrixmul.cu 494
A.2 matrixmul_gold.cpp 497
A.3 matrixmul.h 498
A.4 assist.h 499
A.5 Expected Output 503
Appendix B: GPU Compute Capabilities 504
B.1 GPU Compute Capability Tables 504
B.2 Memory Coalescing Variations 505
Index 510
Chapter 1
Introduction
Chapter Outline
1.1 Heterogeneous Parallel Computing
1.2 Architecture of a Modern GPU
1.3 Why More Speed or Parallelism?
1.4 Speeding Up Real Applications
1.5 Parallel Programming Languages and Models
1.6 Overarching Goals
1.7 Organization of the Book
Microprocessors based on a single central processing unit (CPU), such as those in the Intel Pentium family and the AMD Opteron family, drove rapid performance increases and cost reductions in computer applications for more than two decades. These microprocessors brought GFLOPS, or giga (1012) floating-point operations per second, to the desktop and TFLOPS, or tera (1015) floating-point operations per second, to cluster servers. This relentless drive for performance improvement has allowed application software to provide more functionality, have better user interfaces, and generate more useful results. The users, in turn, demand even more improvements once they become accustomed to these improvements, creating a positive (virtuous) cycle for the computer industry.
This drive, however, has slowed since 2003 due to energy consumption and heat dissipation issues that limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. Since then, virtually all microprocessor vendors have switched to models where multiple processing units, referred to as processor cores, are used in each chip to increase the processing power. This switch has exerted a tremendous impact on the software developer community [Sutter2005].
Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann in his seminal report in 1945 [vonNeumann1945]. The execution of these programs can be understood by a human sequentially stepping through the code. Historically, most software developers have relied on the advances in hardware to increase the speed of their sequential applications under the hood; the same software simply runs faster as each new generation of processors is introduced. Computer users have also become accustomed to the expectation that these programs run faster with each new generation of microprocessors. Such expectation is no longer valid from this day onward. A sequential program will only run on one of the processor cores, which will not become significantly faster than those in use today. Without performance improvement, application developers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced, reducing the growth opportunities of the entire computer industry.
Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to complete the work faster. This new, dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter2005]. The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades. These programs run on large-scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers. Now that all new microprocessors are parallel computers, the number of applications that need to be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming, which is the focus of this book.
1.1 Heterogeneous Parallel Computing
Since 2003, the semiconductor industry has settled on two main trajectories for designing microprocessors [Hwu2008]. The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores. The multicores began with two core processors with the number of cores increasing with each semiconductor process generation. A current exemplar is the recent Intel Core i7™ microprocessor with four processor cores, each of which is an out-of-order, multiple instruction issue processor implementing the full X86 instruction set, supporting hyperthreading with two hardware threads, designed to maximize the execution speed of sequential programs. In contrast, the many-thread trajectory focuses more on the execution throughput of parallel applications. The many-threads began with a large number of threads, and once again, the number of threads increases with each generation. A current exemplar is the NVIDIA GTX680 graphics processing unit (GPU) with 16,384 threads, executing in a large number of simple, in-order pipelines.
Many-threads processors, especially the GPUs, have led the race of floating-point performance since 2003. As of 2012, the ratio of peak floating-point calculation throughput between many-thread GPUs and multicore CPUs is about 10. These are not necessarily application speeds, but are merely the raw speed that the execution resources can potentially support in these chips: 1.5 teraflops versus 150 gigaflops double precision in 2012.
Such a large performance gap between parallel and sequential execution has amounted to a significant “electrical potential” build-up, and at some point, something will have to give. We have reached that point now. To date, this large performance gap has already motivated many application developers to move the computationally intensive parts of their software to GPUs for execution. Not surprisingly, these computationally intensive parts are also the prime target of parallel programming—when there is more work to do, there is more opportunity to divide the work among cooperating parallel workers.
One might ask why there is such a large peak-performance gap between many-threads GPUs and general-purpose multicore CPUs. The answer lies in the differences in the fundamental design philosophies between the two types of processors, as illustrated in Figure 1.1. The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic to allow instructions from a single thread to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications. Neither control logic nor cache memories contribute to the peak calculation speed. As of 2012, the high-end general-purpose multicore microprocessors typically have six to eight large processor cores and multiple megabytes of on-chip cache memories designed to deliver strong sequential code performance.
Figure 1.1 CPUs and GPUs have fundamentally different design philosophies.
Memory bandwidth is another important issue. The speed of many applications is limited by the rate at which data can be delivered from the memory system into the processors. Graphics chips have been operating at approximately six times the memory bandwidth of contemporaneously available CPU chips. In late 2006, GeForce 8800 GTX, or simply G80, was capable of moving data at about 85 gigabytes per second (GB/s) in and out of its main dynamic random-access memory (DRAM) because of graphics frame buffer requirements and the relaxed memory model (the way various system software, applications, and input/output (I/O) devices expect how their memory accesses work). The more recent GTX680 chip supports about 200 GB/s. In contrast, general-purpose processors have to satisfy requirements from legacy operating systems, applications, and I/O devices that make memory bandwidth more difficult to increase. As a result, CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time.
The design philosophy of GPUs is shaped by the fast-growing video game industry that exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games. This demand motivates GPU vendors to look for ways to maximize the chip area and power budget dedicated to floating-point calculations. The prevailing solution is to optimize for the execution throughput of massive numbers of threads. The design saves chip area and power by allowing pipelined memory channels and arithmetic operations to have long latency. The reduced area and power of the memory access hardware and arithmetic units allows the designers to have more of them on a chip and thus increase the total execution throughput.
The application software is expected to be written with a large number of parallel threads. The hardware takes advantage of the large number of threads to find work to do when some of them are waiting for long-latency memory accesses or arithmetic operations. Small cache memories are provided to help control the bandwidth requirements of these applications so that multiple threads that access the same memory data do not need to all go to the DRAM. This design style is commonly referred to as throughput-oriented design since it strives to maximize the total execution throughput of a large number of threads while allowing individual threads to take a potentially much longer time to execute.
The CPUs, on the other hand, are designed to minimize the execution latency of a single thread....
Erscheint lt. Verlag | 31.12.2012 |
---|---|
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge |
Mathematik / Informatik ► Informatik ► Theorie / Studium | |
Technik ► Elektrotechnik / Energietechnik | |
ISBN-10 | 0-12-391418-3 / 0123914183 |
ISBN-13 | 978-0-12-391418-7 / 9780123914187 |
Haben Sie eine Frage zum Produkt? |
Größe: 21,9 MB
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
Größe: 6,7 MB
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich