Welcome and Introduction
Yves Robert (INRIA)
Talk Improving the Trust in Results of Numerical Simulations and Scientific Data Analytics [slides]
Franck Cappello (ANL)
Since the introduction of fault tolerance in HPC, fault tolerance
papers have been dealing mainly with process crashes, network
dysfunctions, and radiation induced bit flips. This orientation
comes directly from the generic fault tolerance problems and
solutions seen in distributed systems. Thus, the main objective
has been to protect the execution - and that mostly in an
Many real-life examples, however, demonstrate that application
results can also be disrupted by hardware and software bugs.
The risk of cyber attack is also taken seriously. Unfortunately,
fault tolerance and resilience techniques currently used in HPC
have not been designed for these types of disruptions; and in
practice most fail to provide useful solutions.
What matter is not only to protect HPC executions; it is also to
protect the correctness of the results such executions produce.
The scientific problem behind this statement is the trustworthiness
of HPC application results. And to improve the trustworthiness, we
must start from the results of the execution, as opposed to how the
execution is implemented with processes and communications. The main
objective is to avoid influential data corruptions related to the
This talk will review the notion of trust, the different types of
disruptions leading to corruption of results, the ways that users
build trust in application results, and the limitations of current
techniques (fault tolerance/resilience, validation and verification,
uncertainty quantification). We will present examples of results
corruptions, some leading to catastrophic consequences, as well as
an approach to improve result trustworthiness.
Talk Expected and Unexpected Challenges to Extreme Scale Reliability [slides]
Bill Kramer (UIUC, NCSA)
Extreme scale systems of today and tomorrow have on the order of a million
to ten million processing elements, tens of millions of memory components
and kilometers of interconnection cables. But these extreme scale systems,
such as Blue Waters, are executing potentially billions of lines of
software at any given instant in time.
Studies of reliability traditionally focus on the many hardware
components, their failure rates and the steps an application might take to
mitigate such failures. While hardware failures are important to address,
it is increasingly obvious many, some argue most, of system failures are
software based. Equally concerning is that recovery time (MTTR) from
software errors takes longer and means the probability of double and
triple faults having to be addressed simultaneously makes recovery and
resiliency much more challenging.
This talk will exam recent trends in reliability and performance analysis
using most the data collected over more than two years of Blue Waters
service. It will draw insights as to the failure causes and possible
solutions to make systems and applications more resilient. It will also
offer comments on how to use today's insights into designing and
implementing better systems and applications in the future.
Talk Checkpointing HPC applications [slides]
Thomas Ropars (INRIA)
The goal of this talk is to give an overview of checkpointing
techniques for HPC applications. It will present the main problems
that need to be solved to ensure the successful execution of
distributed applications despite crash failures. We will start by
reviewing the basics of rollback-recovery techniques for
message-passing applications (happened-before relation, consistent
global state, etc.). We will then describe the main families of
rollback-recovery protocols (checkpointing and message logging) and
discuss their applicability in the HPC context. Finally we will
present the most recent advances (multi-level checkpointing,
hierarchical protocols, etc.) in this area to cope with challenges
raised by very large scale HPC systems.
Talk Optimal checkpointing periods with fail-stop and silent errors [slides]
Anne Benoit (INRIA)
In this talk, we introduce probabilistic models to determine
the optimal checkpointing periods (Young approximation and Daly’s
formula) when the platform is confronted to fail-stop errors, both in
the coordinated protocol and in the hierarchical one. Also, we extend
these classical results to in-memory checkpointing, and discuss the
impact of prediction and replication. Finally, we tackle silent errors
by proposing models to deal with both fail-stop and silent errors, and
we derive the optimal checkpointing periods in realistic frameworks.
Hands-On Mathematical Exercises on Daly and Extensions [slides]
Aurélien Cavelan (INRIA), Hongyang Sun (INRIA)
In this talk, we will demonstrate the mathematical derivations of the classic Young/Daly formula on the optimal checkpointing interval for fail-stop errors.
Also, we will introduce verification mechanisms (partial or guaranteed) to cope with silent errors, as well as multi-level checkpointing protocols for dealing with both error sources. Finally, we will show how to extend the results when also condisering the optimization of energy-consumption.
End of Day 1