Jason D. Bakos is a professor of Computer Science and Engineering at the University of South Carolina. He received a BS in Computer Science from Youngstown State University in 1999 and a PhD in Computer Science from the University of Pittsburgh in 2005. Dr. Bakos's research focuses on mapping data- and compute-intensive codes to high-performance, heterogeneous, reconfigurable, and embedded computer systems. His group works closely with FPGA-based computer manufacturers Convey Computer Corporation, GiDEL, and Annapolis Micro Systems, as well as GPU and DSP manufacturers NVIDIA, Texas Instruments, and Advantech. Dr. Bakos holds two patents, has published over 30 refereed publications in computer architecture and high-performance computing, was a winner of the ACM/DAC student design contest in 2002 and 2004, and received the US National Science Foundation (NSF) CAREER award in 2009. He is currently serving as associate editor for ACM Transactions on Reconfigurable Technology and Systems.
Embedded Systems: ARM Programming and Optimization combines an exploration of the ARM architecture with an examination of the facilities offered by the Linux operating system to explain how various features of program design can influence processor performance. It demonstrates methods by which a programmer can optimize program code in a way that does not impact its behavior but improves its performance. Several applications, including image transformations, fractal generation, image convolution, and computer vision tasks, are used to describe and demonstrate these methods. From this, the reader will gain insight into computer architecture and application design, as well as gain practical knowledge in the area of embedded software design for modern embedded systems. - Covers three ARM instruction set architectures, the ARMv6 and ARMv7-A, as well as three ARM cores, the ARM11 on the Raspberry Pi, Cortex-A9 on the Xilinx Zynq 7020, and Cortex-A15 on the NVIDIA Tegra K1- Describes how to fully leverage the facilities offered by the Linux operating system, including the Linux GCC compiler toolchain and debug tools, performance monitoring support, OpenMP multicore runtime environment, video frame buffer, and video capture capabilities- Designed to accompany and work with most of the low cost Linux/ARM embedded development boards currently available
Multicore and data-level optimization
OpenMP and SIMD
Abstract
Embedded processors share many things in common with desktop and server processors. Like desktop and server processors, mobile embedded systems are comprised of multiple processors but code must be explicitly written to utilize all available processors. Also like desktop and server processors, each processor contains a feature that allows a single instruction to process multiple elements of data, but this also generally requires specific code features to use. Unlike desktop and server processors, embedded processors cannot automatically execute instructions in parallel unless the instructions appear in a favorable order in the software. Together, these aspects of program design can have a substantial performance impact for computationally expensive applications.
This chapter introduces how various program structures affect the degree to which a program can utilize critical system resources such as functional units and memory bandwidth. For each of these, the chapter describes how code optimizations incorporated into the program code can recover lost performance. Understanding how to write and evaluate these types of optimizations is becoming increasingly important for embedded software. Traditional multimedia algorithms such as video decoding are based on well-refined standards and rarely change, but in modern times users have come to expect increasingly advanced algorithms for advanced image processing, such as panoramic image stitching, augmented reality, facial recognition, and object classification. These algorithms are computationally demanding, and often their practicality depends on how efficiently they can be implemented.
Keywords
Multicore
OpenMP
SIMD
Data level parallelism
Instruction level parallelism
Instruction scheduling
ARM NEON
ARM VFP
Floating point
Chapter Outline
2.1 Optimization Techniques Covered by this Book 50
2.2 Amdahl's Law 52
2.3 Test Kernel: Polynomial Evaluation 53
2.4 Using Multiple Cores: OpenMP 55
2.4.1 OpenMP Directives 56
2.4.2 Scope 58
2.4.3 Other OpenMP Directives 62
2.4.4 OpenMP Synchronization 63
2.4.4.1 Critical Sections 63
2.4.4.2 Locks 64
2.4.4.3 Barriers 65
2.4.4.4 Atomic Sections 65
2.4.5 Debugging OpenMP Code 66
2.4.6 The OpenMP Parallel for Pragma 68
2.4.7 OpenMP with Performance Counters 70
2.4.8 OpenMP Support for the Horner Kernel 71
2.5 Performance Bounds 71
2.6 Performance Analysis 73
2.7 Inline Assembly Language in GCC 74
2.8 Optimization #1: Reducing Instructions per Flop 76
2.9 Optimization #2: Reducing CPI 79
2.9.1 Software Pipelining 81
2.9.2 Software Pipelining Horner's Method 84
2.10 Optimization #3: Multiple Flops per Instruction with Single Instruction, Multiple Data 92
2.10.1 ARM11 VFP Short Vector Instructions 94
2.10.2 ARM Cortex NEON Instructions 97
2.10.3 NEON Intrinsics 100
2.11 Chapter Wrap-Up 101
Exercises 102
Desktop and server processors contain many features designed to achieve maximum exploitation of instruction level parallelism and memory locality at runtime, often without regard to the cost of these features in terms of chip area or power consumption. Their design places highest emphasis on superscalar out-of-order speculative execution, accurate branch predication, and extremely sophisticated multilevel caches. This allows them to perform well even when executing code that was not written with performance in mind.
On the other hand, embedded processor design emphasizes energy efficiency over performance, so designers generally forego these features in exchange for on-chip peripherals and specialized coprocessors for specific tasks. Because of this, embedded processor performance is more sensitive to code optimizations than desktop and server processors. Code optimizations include any features of the program code that is specifically designed to improve performance. Code optimizations can be added by the compiler, a tool that automatically transforms code, or the programmer, and can be processor agnostic, such as eliminating redundant code, or processor specific, such as substituting a complex instruction in place of a sequence of simple instructions.
Conceptually, the process of optimizing code often involves starting with a naïve implementation, a serial but functionally correct implementation of a program. The programmer must then identify its kernels, or portions of code in which the most execution time is spent. After this, the programmer must identify the performance bottleneck of each kernel, and then transform the kernel code in a way that improves performance without changing their underlying computation. These changes generally involve removing code redundancy, exploiting parallelism, taking advantage of hardware features, or sacrificing numerical accuracy, precision, or dynamic range in favor of performance.
2.1 Optimization Techniques Covered by this Book
This chapter will cover two programmer-driven optimization techniques:
1. Using Assembly Language to Improve Instruction Level Parallelism and Reduce Compiler Overheads
In some situations hand-written assembly code offers performance advantages over compiler-generated assembly code. You should not expect hand-written assembly to always outperform compiler-generated code. In fact, the automatic optimizers built into modern compilers are usually very effective for integer code, but hand-written assembly is often more effective for intensive floating-point kernels.
2. Multicore Parallelism
Even server processors cannot automatically distribute the workload of a program onto multiple concurrent processor cores. The programmer must add explicit support for multicore into the program code, and is also responsible for verifying that the code is free from concurrency errors such as data races and data sharing errors. Even then, achieving high multicore efficiency is difficult, but is becoming increasingly important in embedded system programming.
The following chapters cover additional topics in program optimization, including:
1. Fixed-Point Arithmetic
Floating-point instructions are usually more costly than integer instructions, but are often unnecessary for multimedia and sensing applications. Using fixed-point representation allows integers to represent fractional numbers at the cost of reduced dynamic range as compared to floating point. Most high-level languages, including C/C++ and Java, lack native support for fixed point so the programmer must include explicit support for fixed-point operations.
2. Loop Transformations
Cache performance is associated with a program's memory access locality, but in some cases the locality can be improved without changing the functionality of the program. This usually involves transforming the structure of loops, such as in loop tiling, where the programmer adds additional levels of loops to change the order in which program data is accessed.
3. Heterogeneous Computing
Many embedded systems, and even systems-on-a-chip, include integrated coprocessors such as Graphical Processor Units, Digital Signal Processors, or Field Programmable Gate Arrays that can perform specific types of computations faster than the general purpose...
Erscheint lt. Verlag | 3.9.2015 |
---|---|
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge |
Mathematik / Informatik ► Informatik ► Theorie / Studium | |
Informatik ► Weitere Themen ► Hardware | |
Technik ► Elektrotechnik / Energietechnik | |
ISBN-10 | 0-12-800412-6 / 0128004126 |
ISBN-13 | 978-0-12-800412-8 / 9780128004128 |
Haben Sie eine Frage zum Produkt? |
Größe: 4,7 MB
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich