Lecture Notes in Computer Science 4192
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Computer Science Department
1122 Volunteer Blvd, Knoxville, TN 37996-3450, USA
E-mail: [email protected]
Library of Congress Control Number: 2006931769
CR Subject Classification (1998): D.1.3, D.3.2, F.1.2, G.1.0, B.2.1, C.1.2
LNCS Sublibrary: SL 2 – Programming and Software Engineering
ISSN 0302-9743
ISBN-10 3-540-39110-X Springer Berlin Heidelberg New York
ISBN-13 978-3-540-39110-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2006
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 11846802 06/3142 543210
Preface
Since its inception in 1994 as a European PVM user’s group meeting, Eu-
roPVM/MPI has evolved into the foremost international conference dedicated to
the latest developments concerning MPI (Message Passing Interface) and PVM
(Parallel Virtual Machine). These include fundamental aspects of these message
passing standards, implementation, new algorithms and techniques, performance
and benchmarking, support tools, and applications using message passing. De-
spite its focus, EuroPVM/MPI is accommodating to new message-passing and
other parallel and distributed programming paradigms beyond MPI and PVM.
– “Scalable Parallel Suffix Array Construction” by Fabian Kulla and Peter
Sanders (page 22)
VI Preface
– “Formal Verification of Programs That Use MPI One-Sided Communica-
tion” by Salman Pervez, Ganesh Gopalakrishnan, Robert M. Kirby, Rajeev
Thakur and William Gropp (page 30)
“Late and breaking results”, which were submitted in August as brief ab-
stracts and therefore not included in these proceedings, were presented in the
eponymous session. Like the “Outstanding Papers” session, this was a premiere
at EuroPVM/MPI 2006.
Complementing the emphasis in the call for papers on new message-passing
paradigms and programming models, the invited talks by Richard Graham,
William Gropp and Al Geist addressed possible shortcomings of MPI for emerg-
ing, large-scale systems, covering issues on fault-tolerance and heterogeneity,
productivity and scalability, while the invited talk of Katherine Yelick dealt
with advantages of higher-level, partitioned global address space languages. The
invited talk of Vaidy Sunderam discussed challenges to message-passing pro-
gramming in dynamic metacomputing environments. Finally, with the invited
talk of Ryutaro Himeno, the audience gained insight into the role and design of
the projected Japanese peta-scale supercomputer.
An important part of EuroPVM/MPI is the technically oriented vendor ses-
sion. At EuroPVM/MPI 2006 eight significant vendors of hard- and software for
high-performance computing (Etnus, IBM, Intel, NEC, Dolphin Interconnect So-
lutions, Hewlett-Packard, Microsoft, and Sun), presented their latest products
and developments.
Prior to the conference proper, four tutorials on various aspects of message
passing programming (“Using MPI-2: A Problem-Based Approach”, “Perfor-
mance Tools for Parallel Programming”, “High-Performance Parallel I/O”, and
“Hybrid MPI and OpenMP Parallel Programming”) were given by experts in
the respective fields.
Joachim Worringen C&C Research Labs, NEC Europe, Germany
Program Committee
George Almasi IBM, USA
Ranieri Baraglia CNUCE Institute, Italy
Richard Barrett ORNL, USA
Gil Bloch Mellanox, Israel
Arndt Bode Technical University of Munich, Germany
Marian Bubak AGH Cracow, Poland
Hakon Bugge Scali, Norway
Franck Cappello Universit´e de Paris-Sud, France
Barbara Chapman University of Houston, USA
Brian Coghlan Trinity College Dublin, Ireland
Yiannis Cotronis University of Athens, Greece
Jose Cunha New University of Lisbon, Portugal
Marco Danelutto University of Pisa, Italy
Frank Dehne Carleton University, Canada
Luiz DeRose Cray, USA
Frederic Desprez INRIA, France
Erik D’Hollander University of Ghent, Belgium
Beniamino Di Martino Second University of Naples, Italy
Jack Dongarra University of Tennessee, USA
Graham Fagg University of Tennessee, USA
Edgar Gabriel University of Houston, USA
Al Geist OakRidge National Laboratory, USA
Patrick Geoffray Myricom, USA
Michael Gerndt Tu M¨unchen, Germany
Andrzej Goscinski Deakin University, Australia
Richard L. Graham LANL, USA
William D. Gropp Argonne National Laboratory, USA
Erez Haba Microsoft, USA
Carsten Trinitis TU M¨unchen, Germany
Jerzy Wasniewski Danish Technical University, Denmark
Roland Wismueller University of Siegen, Germany
Felix Wolf Forschungszentrum J¨ulich, Germany
Joachim Worringen C&C Research Labs, NEC Europe, Germany
Laurence T. Yang St. Francis Xavier University, Canada
External Referees
(excluding members of the Program Committee)
Dorian Arnold
Christian Bell
Boris Bierbaum
Ron Brightwell
Michael Brim
Carsten Clauss
Rafael Corchuelo
Karen Devine
Frank Dopatka
Organization IX
G´abor D´ozsa
Renato Ferrini
Rainer Finocchiaro
Igor Grobman
Yuri Gurevich
Torsten H¨ofler
Andreas Hoffmann
Ralf Hoffmann
Sascha Hunold
Mauro Iacono
Adrian Kacso
Matthew Legendre
Conference Organization
Bernd Mohr
Jesper Larsson Tr¨aff
Joachim Worringen
Sponsors
The conference would have been substantially more expensive and much less
pleasant to organize without the generous support of a good many industrial
sponsors. Platinum and Gold level sponsors also gave talks at the vendor ses-
sion on their latest products in parallel systems and message passing software.
EuroPVM/MPI 2006 gratefully acknowledges the contributions of the sponsors
to a successful conference.
Platinum Level Sponsors
Etnus, IBM, Intel, and NEC.
X Organization
Gold Level Sponsors
Dolphin Interconnect Solutions, Hewlett-Packard, Microsoft, and Sun.
Standard Level Sponsor
QLogic.
Table of Contents
Invited Talks
Too Big for MPI? 1
Al Geist
Approaches for Parallel Applications Fault Tolerance . 2
Richard L. Graham
Where Does MPI Need to Grow? 3
William D. Gropp
Peta-Scale Supercomputer Project in Japan and Challenges to Life and
Human Simulation in Japan 4
Ryutaro Himeno
Resource and Application Adaptivity in Message Passing Systems 5
Jesper Larsson Tr¨aff
Efficient Shared Memory and RDMABasedDesignforMPIAllgather
over InfiniBand 66
Amith Ranjith Mamidala, Abhinav Vishnu, Dhabaleswar K. Panda
Communication Protocols
High Performance RDMA Protocols in HPC . 76
Tim S. Woodall, Galen Mark Shipman, George Bosilca,
Richard L. Graham, Arthur B. Maccabe
Implementation and Shared-Memory Evaluation of MPICH2 over the
Nemesis Communication Subsystem 86
Darius Buntinas, Guillaume Mercier, William Gropp
MPI/CTP: A Reconfigurable MPI for HPC Applications 96
Manjunath Gorentla Venkata, Patrick G. Bridges
Debugging and Verification
Correctness Checking of MPI One-Sided Communication Using
Marmot 105
Bettina Krammer, Michael M. Resch
Table of Contents XIII
An Interface to Support the Identification of Dynamic MPI 2 Processes
for Scalable Parallel Debugging 115
Christopher Gottbrath, Brian Barrett, Bill Gropp,
Ewing “Rusty” Lusk, Jeff Squyres
Modeling and Verification of MPI Based Distributed Software 123
Igor Grudenic, Nikola Bogunovic
Fault Tolerance
FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services:
A Case Study 133
David Dewolfs, Jan Broeckhove, Vaidy Sunderam,
Graham E. Fagg
Scalable Fault Tolerant Protocol for Parallel Runtime
Filesystem MEMFS 222
Jan Seidel, Rudolf Berrendorf, Marcel Birkner,
Marc-Andr´e Hermanns
Effective Seamless Remote MPI-I/O Operations with Derived Data
Types Using PVFS2 230
Yuichi Tsujita
Implementation Issues
Automatic Memory Optimizations for Improving MPI Derived
Datatype Performance 238
Surendra Byna, Xian-He Sun, Rajeev Thakur, William Gropp
Improving the Dynamic Creation of Processes in MPI-2 247
M´arcia C. Cera, Guilherme P. Pezzi, Elton N. Mathias,
Nicolas Maillard, Philippe O.A. Navaux
Object-Oriented Message Passing
Non-blocking Java Communications Support on Clusters 256
Guillermo L. Taboada, Juan Touri˜no, Ram´on Doallo
Modernizing the C++ Interface to MPI . 266
Prabhanjan Kambadur, Douglas Gregor, Andrew Lumsdaine,
Amey Dharurkar
Limitations and Extensions
Can MPI Be Used for Persistent Parallel Services? 275
Robert Latham, Robert Ross, Rajeev Thakur
Table of Contents XV
Observations on MPI-2 Support for Hybrid Master/Slave Applications
in Dynamic and Heterogeneous Environments 285
Claudia Leopold, Michael S¨uß
What MPI Could (and Cannot) Do for Mesh-Partitioning on
Non-homogeneous Networks . 293
Guntram Berti, Jesper Larsson Tr¨aff
Performance
Parallel DSMC Gasflow Simulation of an In-Line Coater for Reactive
Sputtering 383
A. Pflug, M. Siemers, B. Szyszka
Parallel Simulation of T-M Processes in Underground Repository of
Spent Nuclear Fuel 391
Jiˇr´ıStar´y, Radim Blaheta, Ondˇrej Jakl, Roman Kohut
Poster Abstracts
On the Usability of High-Level Parallel IO in Unstructured Grid
Simulations 400
Dries Kimpe, Stefan Vandewalle, Stefaan Poedts
Automated Performance Comparison . 402
Joachim Worringen
Improved GROMACS Scaling on Ethernet Switched Clusters 404
Carsten Kutzner, David van der Spoel, Martin Fechner,
Erik Lindahl, Udo W. Schmitt, Bert L. de Groot,
Helmut Grubm¨uller
Asynchronity in Collective Operation Implementation 406
Alexandr Konovalov, Alexandr Kurylev, Anton Pegushin,
Sergey Scharf
PARUS: A Parallel Programming Framework for Heterogeneous
Multiprocessor Systems 408
Alexey N. Salnikov
Application of PVM to Protein Homology Search 410
Mitsuo Murata
Author Index . 413
Too Big for MPI?
Al Geist
Oak Ridge National Laboratory
Oak Ridge, Tennessee, USA
[email protected]
ity, system design and implementation, and system size. These errors may lead
to catastrophic application failure (termination of an application run with a CPU
failure), silent application errors (such as network data corruption), or application
hangs (such as when network interface card (NIC) malfunction), all wasting valu-
able computer time. For certain classes of computer systems, dealing with these
failures is a requirement to provide a simulation environment reliable enough to
meet end-user needs. Also, the more automated these solutions are, requiring min-
imal or no end-user intervention, the more likely they are to be used to achieve the
required application stability. Dealing with failure, or fault tolerance, while min-
imizing application performance degradation, is an active research area, with no
consensus as to what are optimal solution strategies, or even what failures need
to be considered. Errors include items such as transient data transmission errors
(dropped or corrupt packets), transient and permanent network failures (NIC),
and process failure, to list a few. The current MPI standard addresses a limited
number of failure scenarios, with application termination being the default re-
sponse to failure. While the standard provide a mechanism for users to override
this default response, it does not define error codes that provide information on
system level failures - hardware or software. None-the-less, these need to be ad-
dressed to provide end-users with systems that meet their computing needs. Build-
ing on experience gained in the LA-MPI, FT-MPI, and LAM/MPI projects, the
Open MPI collaboration has implemented, and is continuing to implement op-
tional solutions that deal with a number of failure scenarios, to decrease the appli-
cation mean-time-to-failure rate, to acceptable rates. The types of errors currently
being dealt with include transient network data transmission errors, transient and
permanent NIC failures, and process failure. The talk will discuss fault detection,
fault recovery methods, and the degree to which applications need to be modified
to benefit from these, if any. In addition, the performance impact of these solutions
on several applications will be discussed.
B. Mohr et al. (Eds.): PVM/MPI 2006, LNCS 4192, p. 2, 2006.
c
will be completed in March, 2012. The project will end in March, 2013.
This project includes two important items in software development: grid mid-
dleware and application software in Nano Science and Life Science. The devel-
opment in grid middleware is planed because the supercomputer center which
will operate the Peta-scale supercomputer is planed to provide services not for a
specific institute or application area like the Earth Simulator but for general uses
as a national infrastructure. Nano and Life sciences are the major application
areas we are going to put emphases on as well as industrial applications.
We are starting to select the target applications to make a benchmark suite
in various scientific and industrial applications. We are also discussing concept
design and will finalize it in Summer, 2006. I will introduce the project plan and
application area, especially in Life science, in detail at the conference.
B. Mohr et al. (Eds.): PVM/MPI 2006, LNCS 4192, p. 4, 2006.
c
Springer-Verlag Berlin Heidelberg 2006
Resource and Application Adaptivity in Message
Passing Systems
Vaidy Sunderam
Department of Mathematics and Computer Science
Emory University
Atlanta, Georgia, USA
[email protected]
Clusters and MPP’s are traditional platforms for message passing applications,
but there is growing interest in more dynamic metacomputing environments.
The latter are characterized by dynamicity in availability and available capacity
– of both nodes and interconnects. This talk will discuss fundamental challenges
in executing message passing programs in such environments, and analyze the
issue of adaptivity from the resource and application points of view. Pragmatic
solutions to some of these challenges will then be described, along with new
approaches to dealing with the aggregation of multidomain computing platforms
of two different application frameworks: an immersed boundary method package
and an elliptic solver using adaptive mesh refinement.
B. Mohr et al. (Eds.): PVM/MPI 2006, LNCS 4192, p. 6, 2006.
c
Springer-Verlag Berlin Heidelberg 2006
Using MPI-2: A Problem-Based Approach
William D. Gropp and Ewing Lusk
Mathematics and Computer Science Division
Argonne National Laboratory
Argonne, Illinois, USA
{gropp, lusk}@mcs.anl.gov
MPI-2 introduced many new capabilities, including dynamic process manage-
ment, one-sided communication, and parallel I/O. Implementations of these fea-
tures are becoming widespread. This tutorial shows how to use these features by
showing all of the steps involved in designing, coding, and tuning solutions to
specific problems. The problems are chosen for their practical use in applications
as well as for their ability to illustrate specific MPI-2 topics. Complete examples
will be discussed and full source code will be made available to the attendees.
B. Mohr et al. (Eds.): PVM/MPI 2006, LNCS 4192, p. 7, 2006.
c
Springer-Verlag Berlin Heidelberg 2006
Performance Tools for Parallel Programming
Bernd Mohr and Felix Wolf
Research Centre J¨ulich
J¨ulich, Germany
{b.mohr, f.wolf}@fz-juelich.de
Extended Abstract. Application developers are facing new and more compli-
catedperformancetuningandoptimizationproblemsasarchitecturesbecomemore
complex. In order to achieve reasonable performance on these systems, HPC users
need help from performance analysis tools. In this tutorial we will introduce the
of such dynamic instrumentation systems is the DynInst project from the Univer-
sity of Maryland and University of Wisconsin, which provides an infrastructure to
B. Mohr et al. (Eds.): PVM/MPI 2006, LNCS 4192, pp. 8–9, 2006.
c
Springer-Verlag Berlin Heidelberg 2006
Performance Tools for Parallel Programming 9
helptoolsdeveloperstobuildperformancetools.Wewillcompareandcontrastthese
instrumentation approaches.
Regardless of the instrumentation mechanism, there are two dimensions that
need to be considered for performance data collection: when the performance
collection is triggered and how the performance data is recorded. The triggering
mechanism can be activated by an external agent, such as a timer or a hardware
counter overflow, or internally, by code inserted through instrumentation. The
former is also known as sampling or asynchronous, while the latter is sometimes
referred as synchronous. Performance data can be summarized during runtime
and stored in the form of a profile, or can be stored in the form of traces. We
will present these approaches and discuss how each one reflects a different bal-
ance among data volume, potential instrumentation perturbation, accuracy, and
implementation complexity. Performance data should be stored in a format that
allows the generality and extensibility necessary to represent a diverse set of
performance metrics and measurement points, independent of language and ar-
chitecture idiosyncrasies. We will describe common trace file formats (Vampir,
CLOG, SLOG, EPILOG), as well as profile data formats based on the eXten-
sible Markup Language (XML), which is becoming a standard for describing
performance data representation.
Hardware performance counters have become an essential asset for application
performance tuning. We will discuss in detail how users can access hardware
performance counters using application programming interfaces such as PAPI
and PCL, in order to correlate the behavior of the application to one or more of
the components of the hardware.
[email protected]
2
Dolphin Interconnect Solutions R&D Germany
Wachtberg, Germany
[email protected]
Effectively using I/O resources on HPC machines is a black art. The purpose
of this tutorial is to shed light on the state-of-the-art in parallel I/O and to
provide the knowledge necessary for attendees to best leverage the I/O resources
available to them.
In the first half of the tutorial we discuss the software involved in parallel
I/O. We cover the entire I/O software stack from parallel file systems at the
lowest layer, to intermediate layers (such as MPI-IO), and finally high-level I/O
libraries (such as HDF-5). The emphasis is not just on how to use these layers,
but ways to use them that result in high performance. As part of this discussion
we will present benchmark results from current systems.
The second half of the tutorial will be hands-on, with the participants solving
typical problems of parallel I/O using different approaches. The performance of
these approaches will be evaluate on different machines at remote sites, using
various types of file systems. The results are then compared to get a full picture
of the performance differences and characteristics of the chosen approaches on
the different platforms.
Basic knowledge of parallel (MPI) programming in C and/or Fortran is as-
sumed. For the second half, each participant should bring his own notebook
computer, running either Windows XP or Linux (x86). A limited number of
loan notebook computers are available on request.
B. Mohr et al. (Eds.): PVM/MPI 2006, LNCS 4192, p. 10, 2006.
c
Springer-Verlag Berlin Heidelberg 2006