A parallel implementation on modern hardware for geo-electrical tomographical software - Pdf 10

ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC CÔNG NGHỆ
Nguyễn Hoàng Vũ
A PARALLEL IMPLEMENTATION ON
MODERN HARDWARE FOR GEO-ELECTRICAL
TOMOGRAPHICAL SOFTWARE
KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY
Ngành: Công nghệ thông tin
HÀ NỘI – 2010
ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC CÔNG NGHỆ
Nguyễn Hoàng Vũ
A PARALLEL IMPLEMENTATION ON
MODERN HARDWARE FOR GEO-ELECTRICAL
TOMOGRAPHICAL SOFTWARE
KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY
Ngành: Công nghệ thông tin
Cán bộ hướng dẫn: PGS. TSKH. Phạm Huy Điển
Cán bộ đồng hướng dẫn: TS. Đoàn Văn Tuyến
HÀ NỘI – 2010
ABSTRACT
Geo-electrical tomographical software plays a crucial role in geophysical research.
However, imported software is expensive and does not provide much customizability,
which is essential for more advanced geophysical study. Besides, these programs are
unable to exploit the full potential of modern hardware, so the running time is
inadequate for large-scale geophysical surveys. It is therefore an essential task to
develop domestic software for overcoming all these problems. The development of this
software is based on our research in using parallel programming on modern multi-core
processors and stream processors for high performance computing. While this project
with its inter-disciplinary aspect poses many challenges, it has also enabled us to gain
valuable insights in making scientific software and especially the new field of personal

REFERENCES 50
List of Acronyms
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
GPU Graphical Processing Unit
OpenMP Open Multi Processing
OpenCL Open Computing Language
TBB Intel Threading Building Blocks

INTRODUCTION
Geophysical methods are based on studying the propagation of the different
physical fields within the earth’s interior. One of the most widely used fields in
geophysics is the electromagnetic field generated by natural or artificial (controlled)
sources. Electromagnetic methods comprise one of the three principle technologies in
applied geophysics (the other two being seismic methods and potential field methods).
There are many geo-electromagnetic methods currently used in the world. Of these
electromagnetic methods, resistivity tomography is the most widely used and it is of
major interest in our work.
Resistivity tomography [17] or resistivity imaging is a method used in exploration
geophysics [18] to measure underground physical properties in mineral, hydrocarbon,
ground water or even archaeological exploration. It is closely related to the medical
imaging technique called electrical impedance tomography (EIT), and mathematically is
the same inverse problem. In contrast to medical EIT however, resistivity tomography is
essentially a direct current method. This method is relatively new compared to other
geophysical methods. Since the 1970s, extensive research has been done on the
inversion theory for this method and it is still an active research field today. A detailed
historical description can be seen in [27].

Resistivity tomography surveys searching for oil and gas (left) or
water (right)

only in large institutions, getting access to clusters is also inconvenient. Clusters are not
suitable for field trip as well because of difficulties in transportation and power supply.
Exploiting the parallel capabilities of modern hardware is therefore a must to enable
2
cost-effective scientific computing on desktop systems for such problems. This can help
reduce hardware cost, power consumption and increase user convenience and software
development productivity. These benefits are especially valuable to scientific software
customers in Vietnam where cluster deployment is costly in both money and human
resources.

3
Chapter 1 High Performance Computing on Modern
Hardware
1.1 An overview of modern parallel architectures
Computer speed is crucial in most software, especially scientific applications. As
a result, computer designers have always looked for mechanisms to improve hardware
performance. Processor speed and packaging densities have been enhanced greatly over
the past decades. However, due to the physical limitations of electronic components,
other mechanisms have been introduced to improve hardware performance.
According to [1], the objectives of architectural acceleration mechanisms are to
• decrease latency, the time from start to completion of an operation;
• increase bandwidth, the width and rate of operations.
Direct hardware implementations of expensive operations help reduce execution
latency. Memory latency has been improved with larger register files, multiple register
sets and caches, which exploit the spatial and temporal locality of reference in the
program.
With the bandwidth problem, the solutions can be classified into two forms of
parallelism: pipelining and replication.
Pipelining [22] divides an operation into different stages to enable the concurrent
execution of these stages for a stream of operations. If all of the stages of the pipeline

The second kind of instruction-level parallel architecture is VLIW (very long
instruction word) architectures. A very long instruction word usually controls 5 to 30
replicated execution units. An example of VLIW architecture is the Intel Itanium
processor [23]. As of 2009, Itanium processors can execute up to six instructions per
cycle. For ordinary architectures, superscalar execution and out-of-order execution is
5
used to speed up computing. This increases hardware complexity. The processor must
decide at runtime whether instruction parts are independent so that they can be executed
simultaneously. In VLIW architectures, this is decided at compile time. This shifts the
hardware complexity to software complexity. All operations in one instruction must be
independent so efficient code generation is a hard task for compilers. The problem of
writing compilers and porting legacy software to the new architectures make the Itanium
architecture unpopular.
1.1.2 Process-Level Parallel Architectures
Process-level parallel architectures are architectures that exploit coarse-grained
control parallelism in loops, functions or complete programs. They replicate complete
asynchronously executing processors to increase execution bandwidth and, hence, fit the
multiple-instruction-multiple-data (MIMD) paradigm.
Until a few years ago, these architectures comprised of multiprocessors and
multicomputers.
A multiprocessor uses a shared memory address space for all processors. There
are two kinds of multiprocessors:
• Symmetric Multiprocessor or SMP computers: the cost of accessing an
address in memory is the same for each processor. Furthermore, the
processors are all equal in the eyes of the operation system.
• Non-uniform Memory Architecture or NUMA computers: the cost of
accessing a given address in memory varies from one processor to
another.
In a multicomputer, each processor has its own local memory. Access to remote
memory requires explicit message passing over the interconnection network. They are

generations may have 8, 16 or even 32 cores in the next few years. These new
architectures, especially in multi-processor node, can provide the level of parallelism
that has been only available to cluster systems.
Figure 4 Intel Gulftown CPU .
1.1.3 Data parallel architectures
Data parallel architectures appeared very soon on the history of computing. They
utilize data parallelism to increase execution bandwidth. Data parallelism is common in
many scientific and engineering tasks where a single operation is applied to a whole data
set, usually a vector or a matrix. This allows applications to exhibit a large amount of
independent parallel workloads. Both pipelining and replication have been applied to
hardware to utilize data parallelism.
Pipelined vector processors such as the Cray 1 [15], operates on vectors rather
than scalar. After the instruction is decoded, vectors of data stream directly from
8
memory into the pipelined functional units. Separate pipelines can be chained together
to get higher performance. The translation of sequential code into vector instructions is
called vectorization. A vectorizing compiler played a crucial role in programming for
vector processors. This has significantly pushed the maturity of compilers in generating
efficient parallel code.
Through replication, processor arrays can utilize data parallelism as a single
control unit can order a large number of simple processing elements to operate the same
instruction on different data elements. These massively parallel supercomputers fit into
the single-instruction-multiple-data (SIMD) paradigm.
Although both of the kinds of supercomputers mentioned above have virtually
disappeared from common use, they are precursors for current data parallel
architectures, most notably the CPU SIMD processing and GPUs.
The CPU SIMD extension instruction set for Intel CPUs include MMX, SSE,
SSE2, SSE3, SSE4 and AVX. They allow the CPU to use a single operation to operate
on several data elements simultaneously. AVX, the latest extension instruction set is
expected to be implemented on both Intel and AMD products in 2010 and 2011. With

The Radeon 5870 processor has 20 SIMD engines, each of which has 16 thread
processors inside of it. Each of those thread processors has five arithmetic logic units, or
ALUs. With a total of 1600 stream processors and a clock speed of 850 MHz, Radeon
5870 has the single-precision computing power of 2.72 Tflops while top of the line CPU
still has processing power counted in Gflops. Double-precision computing is done at one
fifth of the rate for single-precision, at 544 Gflops. This card supports both OpenCL and
DirectCompute. The double version, the Radeon 5970 (Hemlock) dual graphics
processor has a single-precision computing power of 4.7 Tflops in a graphics card at a
thermal envelope of less than 300 W. Custom over clocked versions made by graphics
card manufacturer can even offer much more computing power than the original version.

Figure 6 ATI Radeon 5870 (Cypress) graphics processor
The Nvidia GF100 processor has 3 billion transistors with 15 SM (Shader
Multiprocessor) units, each has 32 shader cores or CUDA processor compared to 8 of
previous Nvidia GPUs. Each CUDA processor has a fully pipelined integer arithmetic
logic unit and floating point unit with better standard conformance and fused multiply-
add instruction for both single and double precision. The integer precision was raised
11
from 24 bit to 32 bit so multi-instruction emulation is no longer required. Special
function units in each SM can execute transcendental instructions such as sin, cosine,
reciprocal and square root.
Figure 7 Nvidia GF100 (Fermi) processor with parallel kernel execution
Single-precision performance of GF100 is about 1.7 Tflops but double-precision
performance is only half at 800 Gflops, significantly better than the Radeon 5870.
Previous architectures required that all SMs in the chip worked on the same kernel
12
(function/program/loop) at the same time. In this generation the GigaThread scheduler
can execute threads from multiple kernels in parallel. This chip is specifically designed
to provide better support for GPGPU with memory error correction, native support for
C++ (including virtual functions, function pointers, dynamic memory management using

13
by aiming to produce an x86-compatible GPU that would later be integrated into Intel
CPUs. These would all lead to a new kind of processor called Accelerated Processing
Unit (APU).
The third trend is the evolution of multicore CPUs into many-core processors in
which individual cores form a cluster system. In December 2009, Intel unveiled the
newest product of its Terascale Computing Research program, a 48-core x86 processor.

Figure 8 The Intel 48-core processor. To the right is a dual-core tile. The
processor has 24 such tiles in a 6 by 4 layout.
It represents the sequel to Intel's 2007 Polaris 80-core prototype that was based
on simple floating point units. This device is called a "Single-chip Cloud Computer"
(SCC). The structure of the chip resembles that of a cluster with cores connected
through a message-passing network with 256 GB/s bandwidth. Shared-memory is
simulated on software. Cache coherence and power management is also software-based.
Each core can run its own OS and software, which resembles a cloud computing center.
Each tile (2 cores) can have its own frequency, and groupings of four tiles (8 cores) can
each run at their own voltage. The SCC can run all 48 cores at one time over a range of
25W to 125W and selectively vary the voltage and frequency of the mesh network as
well as sets of cores. This 48 core device consists of 1.3 billion transistors produced
using 45nm high-k metal gate. Intel are currently handing out these processors to its
partners in both industry and academy to enhance further research in parallel computing.
Tilera corporation is also producing processors with one hundred cores. Each
core can run a Linux OS independently. The processor also has Dynamic Distributed
14
Cache technology which provides a fully coherent shared cache system across an
arbitrary sized array of tiles. Programming can be done normally on a Linux derivative
with full support for C and C++ and Tilera parallel libraries. The processor utilizes
VLIW (Very Long Instruction Word) with RISC instructions for each core. The primary
focus of this processor is for networking, multimedia and clouding computing with a

need and many tasks still need to be done manually. Algorithms also need to be adapted
to the limitations of current hardware and software tools.
In the following parts, we will present some programming tools for desktop
systems with multi-core CPUs and multi GPUs that we that we consider useful for
exploiting parallelism in scientific computing. The grouping is just for easy comparison
between similar tools as some tools provide more than one kind of parallelization.
1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, and
Cilk++
Windows and Linux (and other Unixes) provide API’s for creating and
manipulating operating system threads using WinAPI threads and POSIX threads
(Pthreads), respectively. These threading approaches may be convenient when there's a
natural way to functionally decompose an application - for example, into a user interface
thread, a compute thread or a render thread.
However, in the case of more complicated parallel algorithms, the manual
creating and scheduling thread can lead to more complex code, longer development time
and not optimal execution.
The alternative is to program atop a concurrency platform — an abstraction layer
of software that coordinates, schedules, and manages the multicore resources.
Using thread pools is a parallel pattern that can provide some improvements. A
thread pool is a strategy for minimizing the overhead associated with creating and
destroying threads and is possibly the simplest concurrency platform. The basic idea of a
thread pool is to create a set of threads once and for all at the beginning of the program.
When a task is created, it executes on a thread in the pool, and returns the thread to the
pool when finished. A problem is when the task arrives and the pool has no thread
available. The pool then suspends the task and wakes it up when a new thread is
available. This requires synchronization such as locks to ensure atomicity and avoid
concurrency bugs. Thread pools are common for the server-client model but for other
tasks, scalability and deadlocks still pose problems.
This calls for concurrency platforms with higher levels of abstraction that
provide more scalability, productivity and maintainability. Some examples are OpenMP,

17
Beside scheduling clauses, OpenMP also has clauses for data sharing attribute,
synchronization, IF control, initialization, data copying, reduction and other concurrent
operations.
A typical OpenMP parallelized loop may look like:
#pragma omp for schedule(dynamic, CHUNKSIZE)
for(int i = 2; i <= N-1; i++)
for(int j = 2; j <= i; j++)
for(int k = 1; k <= M; k++)
b[i][j]+=a[i-1][j]/k+a[i+1][j]/k;
Figure 9 OpenMP fork-join model.
Intel’s Threading Building Blocks (TBB) [4] is an open source C++ template
library developed by Intel for writing task based multithreaded applications with ideas
and models inherited from many previous languages and libraries. While OpenMP uses
the pragma approach for parallelization, TBB uses the library approach. The first
version came out in August 2006 and since then TBB has seen widespread use in many
applications, especially game engines such as Unreal. TBB is available with both a
commercial license and an open source license. The latest version 2.2 was introduced in
August 2009. The library has also received Jolt Productivity award and InfoWorld OSS
award.
It is a library based on generic programming, requires no special compiler support,
and is processor and OS independent. This makes TBB ideal for parallelizing legacy
applications. TBB has support for Windows, Linux, OS X, Solaris, PowerPC, Xbox,
18
QNX, FreeBSD and can be compiled using Visual C++, Intel C++, gcc and other
popular compilers.

TBB is not a thread-replacement library but provides a higher level of
abstraction. Developers do not work directly with threads but tasks, which are mapped
to threads by the library runtime. The number of threads are automatically managed by


Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status