Khóa luận tốt nghiệp ngành công nghệ thông tin a parallel implementation on modern hardware for geo electrical tomographical software - Pdf 37

ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC CÔNG NGHỆ

Nguyễn Hoàng Vũ

A PARALLEL IMPLEMENTATION ON
MODERN HARDWARE FOR GEO-ELECTRICAL
TOMOGRAPHICAL SOFTWARE

KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY
Ngành: Công nghệ thông tin

HÀ NỘI – 2010

ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC CÔNG NGHỆ

Nguyễn Hoàng Vũ

A PARALLEL IMPLEMENTATION ON
MODERN HARDWARE FOR GEO-ELECTRICAL
TOMOGRAPHICAL SOFTWARE

KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY
Ngành: Công nghệ thông tin

Cán bộ hướng dẫn: PGS. TSKH. Phạm Huy Điển
Cán bộ đồng hướng dẫn: TS. Đoàn Văn Tuyến

HÀ NỘI – 2010

1.1.2 Process-Level Parallel Architectures

6

1.1.3 Data parallel architectures

8

1.1.4 Future trends in hardware

13

1.2 Programming tools for scientific computing on personal desktop systems ...... 15
1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks,
and Cilk++

16

1.2.2 GPU programming with CUDA

22

1.2.3 Heterogeneous programming and OpenCL

27

CHAPTER 2. THE FORWARD PROBLEM IN RESISTIVITY
TOMOGRAPHY

29

OpenMP

Open Multi Processing

OpenCL

Open Computing Language

TBB

Intel Threading Building Blocks

INTRODUCTION
Geophysical methods are based on studying the propagation of the different
physical fields within the earth’s interior. One of the most widely used fields in
geophysics is the electromagnetic field generated by natural or artificial (controlled)
sources. Electromagnetic methods comprise one of the three principle technologies in
applied geophysics (the other two being seismic methods and potential field methods).
There are many geo-electromagnetic methods currently used in the world. Of these
electromagnetic methods, resistivity tomography is the most widely used and it is of
major interest in our work.
Resistivity tomography [17] or resistivity imaging is a method used in
exploration geophysics [18] to measure underground physical properties in mineral,
hydrocarbon, ground water or even archaeological exploration. It is closely related to
the medical imaging technique called electrical impedance tomography (EIT), and
mathematically is the same inverse problem. In contrast to medical EIT however,
resistivity tomography is essentially a direct current method. This method is relatively
new compared to other geophysical methods. Since the 1970s, extensive research has

Our resistivity tomographical software is an example of applying high
performance computing on modern hardware to computational geoscience. For 2-D
surveys with small datasets, sequential programs still provide results in acceptable
time. Parallelizing for these situations provides faster response time and therefore
increases research productivity but is not a critical feature. However, for 3-D surveys,
datasets are much larger with high computational expenses. A solution for this
situation is using clusters. Clusters, however, are not a feasible option for many
scientific institutions in Vietnam. Clusters are expensive with high power
consumption. With limited availability only in large institutions, getting access to
clusters is also inconvenient. Clusters are not suitable for field trip as well because of
2

difficulties in transportation and power supply. Exploiting the parallel capabilities of
modern hardware is therefore a must to enable cost-effective scientific computing on
desktop systems for such problems. This can help reduce hardware cost, power
consumption and increase user convenience and software development productivity.
These benefits are especially valuable to scientific software customers in Vietnam
where cluster deployment is costly in both money and human resources.

3

Chapter 1 High Performance Computing on Modern Hardware
1.1 An overview of modern parallel architectures
Computer speed is crucial in most software, especially scientific applications.
As a result, computer designers have always looked for mechanisms to improve
hardware performance. Processor speed and packaging densities have been enhanced
greatly over the past decades. However, due to the physical limitations of electronic
components, other mechanisms have been introduced to improve hardware

There are two common kinds of instruction-level parallel architecture.
The first is superscalar pipelined architectures which subdivide the execution
of each machine instruction into a number of stages. As short stages allow for high
clock frequencies, the recent trend is to use longer pipeline. For example the Pentium 4
uses a 20-stage pipeline and the latest Pentium 4 core contains a 31-stage pipeline.

Figure 1 Generic 4-stage pipeline; the colored boxes represent instructions
independent of each other [21].
A common problem with these pipelines is branching. When branches happen,
the processor has to wait until the branch finishes fetching the next instruction. A
branch prediction unit is put into the CPU to guess which branch would be executed.
However, if branches are predicted poorly, the performance penalty can be high. Some
programming techniques to make branches in code more predictable for hardware can
be found in [2]. Programming tools such as Intel VTune Performance Analyzer can be
of great help in profiling programs for missed branch predictions.
The second kind of instruction-level parallel architecture is VLIW (very long
instruction word) architectures. A very long instruction word usually controls 5 to 30
replicated execution units. An example of VLIW architecture is the Intel Itanium
processor [23]. As of 2009, Itanium processors can execute up to six instructions per
5

cycle. For ordinary architectures, superscalar execution and out-of-order execution is
used to speed up computing. This increases hardware complexity. The processor must
decide at runtime whether instruction parts are independent so that they can be
executed simultaneously. In VLIW architectures, this is decided at compile time. This
shifts the hardware complexity to software complexity. All operations in one
instruction must be independent so efficient code generation is a hard task for
compilers. The problem of writing compilers and porting legacy software to the new
architectures make the Itanium architecture unpopular.

incorporate heterogeneous collections of computers, possibly distributed
geographically. They are, therefore, optimized for workloads containing many
independent packets of work. The two biggest grid computing network is
Folding@home and SETI@home (BOINC). Both have the computing capability of a
few petaflops while the most powerful traditional cluster can barely reach over 1
petaflops.

Figure 3 Intel CPU trends [12].

7

The most notable change to process-level parallel architectures happened in the
last few years. Figure 3 shows that although the number of transistors a CPU contains
still increases according to Moore’s law (which means doubling every 18 months), the
clock speed has virtually stopped rising due to heating and manufacturing problems.
CPU manufacturers have now turned to adding more cores to a single CPU while the
clock speed stays the same or decreases. An individual core is a distinct processing
element and is basically the same as a CPU in an older single-core PC. A multi-core
chip can now be considered a SMP MIMD parallel processor. A multi-core chip can
run at lower clock speed and therefore consumes less power but still has increases in
processing power.
The latest Intel Core i7-980 (Gulftown) CPU has 6 cores and 12 MB of cache.
With hyper-threading it can support up to 12 hardware threads. Future multi-core CPU
generations may have 8, 16 or even 32 cores in the next few years. These new
architectures, especially in multi-processor node, can provide the level of parallelism
that has been only available to cluster systems.

Figure 4 Intel Gulftown CPU

AVX, the size of SIMD vector register is increased from 128-bit to 256-bit, which
means the CPU can operate on 8 single-precision or 4 double-precision floating point
numbers during one instruction. CPU SIMD processing has been used widely by
programmers in many applications such as multimedia and encryption and compiler
code generation for these architectures are now considerably good. Even when
multicore CPUs are popular, understanding SIMD extensions is still vital for
optimizing program execution on each CPU core. A good handbook on utilizing
software vectorization is [1].
However, graphics processing units (GPUs) are perhaps the hardware with the
most dramatic growth in processing power over the last few years.
Graphics chips started as fixed function graphics pipelines. Over the years, these
graphics chips became increasingly programmable with newer graphics API and
shaders. In the 1999-2000 timeframe, computer scientists in particular, along with
researchers in fields such as medical imaging and electromagnetic started using GPUs
for running general purpose computational applications. They found the excellent
9

floating point performance in GPUs led to a huge performance boost for a range of
scientific applications. This was the advent of the movement called GPGPU or
General Purpose computing on GPUs.
With the advent of programming languages such as CUDA and OpenCL, GPUs
are now easier to program. With the processing power of a few Teraflops, GPUs are
now massively parallel processors at a much smaller scale. They are now also termed
stream processors as data is streamed directly from memory into the execution units
without the latency like the CPUs. As can be seen in Figure 5, GPUs have currently
outpaced CPUs many times in both speed and bandwidth.

Figure 5 Comparison between CPU and GPU speed and bandwidth (CUDA
programming Guide) [8].

12

Single-precision performance of GF100 is about 1.7 Tflops but double-precision
performance is only half at 800 Gflops, significantly better than the Radeon 5870.
Previous architectures required that all SMs in the chip worked on the same kernel
(function/program/loop) at the same time. In this generation the GigaThread scheduler
can execute threads from multiple kernels in parallel. This chip is specifically designed
to provide better support for GPGPU with memory error correction, native support for
C++ (including virtual functions, function pointers, dynamic memory management
using new and delete and exception handling), and compatible with CUDA, OpenCL
and DirectCompute. A true cache hierarchy with two levels is added with more shared
memory than previous GPU generations. Context switching and atomic operations are
also faster. Fortran compilers are also available from PGI. Specific versions for
scientific computing will have from 3GB to 6GB GDDR5.
1.1.4 Future trends in hardware
Although the current parallel architectures are very powerful, especially for
parallel workload, they won’t stay the same way in the future. From the current
situation, we can present some trends for future hardware in the next few years.
The first is the change in the composition of clusters. A cluster node can now
have several multicore processors and some graphics processors. Consequently,
clusters with fewer nodes can still have the same processing power. This also enables
the maximum limit of cluster processing capabilities to increase. Traditional clusters
consisting of only CPU nodes have virtually reached their peak at about 1 Pflops.
Adding more nodes would result in more system overhead with marginal increase in
speed. Electricity consumption is also enormous for such systems. Supercomputing
now accounts for 2 percents of the total electric consumption of the entire United
States. Building supercomputer at the exascale (1000 Pflops) using traditional clusters
is too much costly. Graphics processors or similar architectures provide a good
Gflops/W ratio and are, therefore, vital to building supercomputers with larger

one time over a range of 25W to 125W and selectively vary the voltage and frequency
14

of the mesh network as well as sets of cores. This 48 core device consists of 1.3 billion
transistors produced using 45nm high-k metal gate. Intel are currently handing out
these processors to its partners in both industry and academy to enhance further
research in parallel computing.
Tilera corporation is also producing processors with one hundred cores. Each
core can run a Linux OS independently. The processor also has Dynamic Distributed
Cache technology which provides a fully coherent shared cache system across an
arbitrary sized array of tiles. Programming can be done normally on a Linux derivative
with full support for C and C++ and Tilera parallel libraries. The processor utilizes
VLIW (Very Long Instruction Word) with RISC instructions for each core. The
primary focus of this processor is for networking, multimedia and clouding computing
with a strong emphasis on integer computation to complement GPU’s floating point
computation.
From all these trends, it would be reasonable to assume that in the near future,
we will be able to see new architectures which resemble all current architectures, such
as many-core processors where each core has a CPU core and stream-processors as coprocessors. Such systems would provide tremendous computing power per processor
that would cause major changes in the field of computing.

1.2 Programming tools for scientific computing on personal desktop
systems
Traditionally, most scientific computing tasks have been done on clusters.
However, with the advent of modern hardware that provide great level of parallelism,
many small to medium-sized tasks can now be run on a single high-end desktop
computer in reasonable time. Such systems are called “personal supercomputers”.
Although they have variable configurations, most today employ multicore CPUs with
multiple GPUs. An example is the Fastra II desktop supercomputer [3] at University of

natural way to functionally decompose an application - for example, into a user
interface thread, a compute thread or a render thread.
However, in the case of more complicated parallel algorithms, the manual
creating and scheduling thread can lead to more complex code, longer development
time and not optimal execution.
The alternative is to program atop a concurrency platform — an abstraction
layer of software that coordinates, schedules, and manages the multicore resources.
Using thread pools is a parallel pattern that can provide some improvements. A
thread pool is a strategy for minimizing the overhead associated with creating and
destroying threads and is possibly the simplest concurrency platform. The basic idea of
a thread pool is to create a set of threads once and for all at the beginning of the
program. When a task is created, it executes on a thread in the pool, and returns the
thread to the pool when finished. A problem is when the task arrives and the pool has
16

no thread available. The pool then suspends the task and wakes it up when a new
thread is available. This requires synchronization such as locks to ensure atomicity and
avoid concurrency bugs. Thread pools are common for the server-client model but for
other tasks, scalability and deadlocks still pose problems.
This calls for concurrency platforms with higher levels of abstraction that
provide more scalability, productivity and maintainability. Some examples are
OpenMP, Intel Threading Building Blocks, and Cilk++ .

OpenMP (Open Multiprocessing) [25] is an open concurrency platform with
support for multithreading through compiler pragmas in C, C++ and Fortran. It is an
API specification and compilers can provide different implementations. OpenMP is
governed by the OpenMP Architecture Review Board (ARB). The first OpenMP
specification came out in 1997 with support for Fortran, followed by C/C++ support in
1998. Version 2.0 was released in 2000 for Fortran and 2002 for C/C++. Version 3.0

for(int i = 2; i

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Khóa luận tốt nghiệp ngành công nghệ thông tin a parallel implementation on modern hardware for geo electrical tomographical software - Pdf 37

Tài liệu, ebook tham khảo khác

Học thêm