Optimization of Fault-Tolerance Strategies for Workflow Applications

Research topic and goals

In this project, we aim at finding efficient fault-tolerant scheduling schemes for workflow applications that can be expressed as a directed acyclic graph (DAG) of tasks.

Checkpointing-recovery is the traditional fault-tolerance technique when it comes to resilience for large-scale platforms. Unfortunately, as platform scale increases, checkpoints must become more frequent to accommodate with the increasing Mean Time Between Failure (MTBF). As such, it is expected that checkpoint-recovery will become a major bottleneck for applications running on post-petascale platforms.

We first focus on replication as a way of mitigating the checkpointing-recovery overhead. A task can be checkpointed and/or replicated, so that if a single replica fails, no recovery is needed. Our goal is to decide which task to checkpoint, which task to replicate, and how much resource should be allocated to each task for the execution of general workflow applications. For that, we first need to derive a clear model for replication, as there are many ways to implement it, even for a single task.

Results for 2016/2017

The initial work for this project has been focused towards using replication as a detection and correction mechanism for Silent Data Corruptions (SDC). Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique.

In this project, replication is combined with checkpointing to enable rollback and recovery when forward recovery is not possible, which occurs when too many replicas are corrupted. The goal is to find the right level of replication (duplication, triplication or more) needed to efficiently detect and correct silent errors at scale. We have provided a detailed analytical study for this framework.

Results for 2017/2018

We have extended these results for platforms subject to both silent and fail-stop errors. Fail-stop errors are immediately detected, unlike silent errors, and replication may also help tolerating such errors.

We have considered two flavors of replication: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck.

We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model.

While the previous work had focused on applications for which we can decide at which frequency we can checkpoint, and our aim has been to find the optimal checkpointing period, we have also initiated the study for linear chains of parallel tasks. The aim is then to decide which tasks to checkpoint and/or replicate. In this case, we have provided an optimal dynamic programming algorithm, and an extensive set of simulations to assess (i) in which scenarios checkpointing performs better than replication, or vice-versa; and (ii) in which scenarios the combination of both methods is useful, and to what extent.

Results for 2018/2019

In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. We have considered a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance. This work has received the best paper award at APDCM 2018, a workshop run in conjunction with IPDPS 2018.

Three JLESC partners, Inria, Riken and UTK, have conducted a study to compare the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) rigid applications, which use a constant number of processors throughout execution; (ii) moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) rid applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.

Also, the Inria partner has studied checkpointing strategies for workflows and has shown how to design an efficient strategy that achieves an efficient trade-off between extreme approaches where either all application data is checkpointed, or no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures (Han et al. 2018), (Han et al. 2018).

Results for 2019/2020

We have revisited replication coupled with checkpointing for fail-stop errors. This work focuses on divisible-load applications rather than workflows. In this context, replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the NORESTART strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period P à la Young/Daly, replacing the MTBF by the MTTI; and (iii) never restart failed processors until the application crashes. We introduce the RESTART strategy where failed processors are restarted after each checkpoint. We compute the optimal checkpointing period Q for this strategy, which is much larger than P thereby decreasing I/O pressure. We show through simulations that using Q and the RESTART strategy, instead of P and the usual NORESTART strategy, significantly decreases the overhead induced by replication, in terms of both total execution time and energy consumption. This work has appeared in the proceedings of SC’2019.

Results for 2020/2021

We have first focused on the resilient scheduling of parallel jobs on high performance computing (HPC) platforms to minimize the overall completion time, or makespan. We have revisited the problem by assuming that jobs are subject to transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in this classical framework, list scheduling that gives priority to longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list scheduling remains a 2-approximation. We have designed several heuristics, some list-based and some shelf-based, along with different priority rules and backfilling options. We have assessed and compared their performance through an extensive set of simulations, using both synthetic jobs and log traces from the Mira supercomputer. This work has appeared at APDCM’2020, and an extended version appeared in IJNC.

We have then focused the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or the makespan, when jobs can fail due to silent errors and hence may need to be re-executed after each failure until successful completion. Again, this work generalizes the classical scheduling framework for failure-free jobs. To cope with silent errors, we introduce two resilient scheduling algorithms, LPA-List and Batch-List, both of which use the List strategy to schedule the jobs. Without knowing a priori how many times each job will fail, LPA-List relies on a local strategy to allocate processors to the jobs, while Batch-List schedules the jobs in batches and allows only a restricted number of failures per job in each batch. We prove new approximation ratios for the two algorithms under several prominent speedup models (e.g., roofline, communication, Amdahl, power, monotonic, and a mixed model). An extensive set of simulations is conducted to evaluate different variants of the two algorithms, and the results show that they consistently outperform some baseline heuristics. Overall, our best algorithm is within a factor of 1.6 of a lower bound on average over the entire set of experiments, and within a factor of 4.2 in the worst case. Preliminary results with a subset of speedup models have been published in Cluster’2020.

In parallel, we have also been working on the comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication.
These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by recovering the corrupted entries from the correct data and the checksums by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes. In addition, with respect to the literature, we have considered relatively high error rates.

Results for 2021/2022

At Inria, we have pursued the work on resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms, and an extension of the results has been published in Transactions on Computers (TC) in August 2021 (Benoit et al. 2021).

We have also proposed an optimal checkpointing strategy to protect iterative applications from fail-stop errors. We consider a general framework, where the application repeats the same execution pattern by executing consecutive iterations, and where each iteration is composed of several tasks. These tasks have different execution lengths and different checkpoint costs.
Assume that there are $n$ tasks, and that task $a_i$ has execution time $t_{i}$ and checkpoint cost $c_{i}$. A naive strategy would checkpoint after each task. Another naive strategy would checkpoint at the end of each iteration. A strategy inspired by the Young/Daly formula would work for $\sqrt{2 \mu C}$ seconds, where $\mu$ is the application MTBF and C is the average checkpoint time, and checkpoint at the end of the current task (and repeat). Another strategy, also inspired by the Young/Daly formula, would select the task $a_{min}$ with the smallest checkpoint cost $c_{min}$ and would checkpoint after every p-th instance of that task, leading to a checkpointing period $p T$, where $T = \sum_{i=0}^{n-1} a_{i}$ is the time per iteration. One would choose the period so that $p T \approx \sqrt{2 \mu c_{\min}}$ to obey the Young/Daly formula. All these naive and Young/Daly strategies are suboptimal. Our main contribution is to show that the optimal checkpointing strategy is globally periodic, and to design a dynamic programming algorithm that computes the optimal checkpointing pattern. This pattern may well checkpoint many different tasks, and this across many different iterations. We show through simulations, both from synthetic and real-life application scenarios, that the optimal strategy outperforms the naive and Young/Daly strategies. These results have been published in IEEE Transactions on Parallel and Distributed Systems (TPDS) in 2022 (Du et al. 2022).

At UTK, there has been some work on several fronts:

Interruption Scope: Failure detection and subsequent error reporting is the starting point of the entire recovery procedure. In the ULFM fault tolerant MPI, the scope of error reporting has been limited to processes that perform a direct communication operation with a failed process. The main rationale is found in application use cases (e.g., master/worker, stencil, job-stealing, etc.) that operate under the assumption that communication between non-failed processes should proceed without interruption (running through failures except when critical). However, when considering complex applications comprised on multiple software modules, certain modules may require more thorough information about errors. Many recovery techniques are global in nature, and a fault must trigger a recovery action in all modules of the application—including in processes that may not be currently involved in communications for the module. Thus we added controls that can set MPI communicators in a mode that will implicitly interrupt MPI communication on the communicator when any process on the communicator group, or in the entire application, has failed. The benefit of this scoped recovery is to significantly improve on the complexity of the error management code by streamlining multiple error paths, making implicit the propagation of errors when appropriate, and enable cross-module, cross-library response to faults.
Uniform Errors: Collective communication in MPI are a powerful tool for expressing patterns where a group of processes participate in the same communication operation. By abstracting the intent of the communication pattern, the operation can help the user structure the code flow, and let the MPI implementation optimize. A simplistic assumption postulates that such structuring features would translate intact into a regime with failures. A notable aspect is if errors should be raised uniformly, that is the same error code is returned at all rank by the same collective operation. While this is desirable for the structuring aspect this gives to fault-tolerant codes, this also entails a significant overhead in most cases as it adds a supplementary concensus, an extra fault-tolerant synchronization step that is not required by the semantic of the underlying communication. Thus, to protect fault-free performance, users need to be able to control when such consensus are performed. The MPI_COMM_AGREE operation has provided an explicit means for users to synchronize in a fault-tolerant fashion when the need arises. We added a per-communicator property (practically, by setting specific MPI Info keys on the communicator) that lets the user decide whether collective operations on the communicator should operate at maximum speed (non-uniform) or with implicit safety (uniform error reporting at all ranks). This control enables a distinction between pure communication operations (which are often critical for performance) and setup/management operations (like operations that create new communicators, or operations that change the logical epoch in the algorithm).
Asynchronous Recovery: The MPI_COMM_SHRINK operation provides the operational construct that permits recreating a fully functional communication context in which not only select point-to-point communication but also collective communication can be carried. The core of the operation relies on agreeing on the set of failed processes that need to be excluded from the input communicator and then producing a resultant communicator with a well-specified membership of processes. The current definition of the shrink operation is blocking and synchronous, which limits the opportunity for overlapping its cost. We introduce a new non-blocking variant of the shrink operation, MPIX_COMM_ISHRINK(comm, newcomm, req). To support this operation, we designed a non-blocking variant of the consensus-like elimination of failed processes as well as the selection of the internal context identifier (a unique number that is used to match MPI messages to the correct communicator on which the operation was posted). The expected advantages are multiple. Some applications may have to recover multiple communicators after a failure; with the availability of the non-blocking shrink, these multiple shrink operation have an opportunity to overlap one another. The non-blocking shrink also gives an opportunity for overlap between the cost of rebuilding communicators with the cost of restoring the application dataset (e.g., reloading checkpoints, computing checksum on data, etc.).

Results for 2022/2023

First, we have been further investigating the Young/Daly formula. This formula provides an approximation of the optimal checkpoint period for a parallel application executing on a supercomputing platform. It was originally designed for preemptible tightly-coupled applications. We provided some background and survey various application scenarios to assess the usefulness and limitations of the formula in (Benoit et al. 2022).

Also, we have been revisiting distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduced an efficient variant of the Credit Distribution Algorithm (CDA) and compared it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyzed the behavior of each algorithm for some simplified task-based kernels and showed the superiority of CDA in terms of the number of control messages. We then compared the implementation of these algorithms over a task-based runtime system, PaRSEC and showed the advantages and limitations of each approach on a practical implementation (Bosilca et al. 2022).

Some progress on the projects mentionned in 2021-2022 were also done by UT Knoxville, in particular on integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach, demonstrating that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More application-specific choice in resilience strategies allows for better long-term flexibility, performance, and - importantly - simplicity.

Finally, we have initiated a study about the impact of I/O interference on application performance. HPC applications execute on dedicated nodes but share the I/O system. As a consequence, interferences surge when several applications perform I/O operations simultaneously, and each I/O operation takes much longer than expected because each application is only allotted a fraction of the I/O bandwidth. Checkpoint operations are periodic and high-volume I/O operations and, as such, are particularly sensitive to interferences.

Results for 2023/2024

This year, as a follow-up to our joint work published last year in (Benoit et al. 2022), we have extended this paper by adding contributions of other JLESC members (Leonardo Bautista-Gomez from BSC, and Sheng Di from ANL). Hence, we have considerably extended the scope of our survey, and we have submitted this contribution, entitled “A Survey on Checkpointing Strategies: Should We Always Checkpoint à la Young/Daly?”, to the special issue of FGCS scheduled for 2024 and which will focus on JLESC collaboration results. We are covering several new topics such as multi-level checkpointing, checkpointing preemptible applications in practice, checkpoints taking variable times, silent error detectors, imperfect verifications, cases where the order of the optimal checkpointing period changes, and the combination of checkpointing with replication.

We have also considered applications executing for a fixed duration, namely the length of the reservation that it has been granted. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. We addressed two scenarios. In the first scenario, a checkpoint can be taken at any time; despite its simplicity, this natural problem has not been considered yet (to the best of our knowledge). We provided the optimal solution for a variety of probability distribution laws modeling checkpoint duration. The second scenario was more involved: the application is a linear workflow consisting of a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduced a static strategy where we computed the optimal number of tasks before the application checkpoints at the beginning of the execution. Then, we designed a dynamic strategy that decides whether to checkpoint or to continue executing at the end of each task. We instantiated this second scenario with several examples of probability distribution laws for task durations. This work has been published in FTXS’2023, a workshop co-located with SC’2023 (Barbut et al. 2023).

Results for 2024/2025

This year, we have published a survey on the FGCS special issue focusing on JLESC collaboration results, see (Bautista-Gomez et al. 2024). The Young/Daly formula provides an approximation of the optimal checkpointing period for a parallel application executing on a supercomputing platform. It was originally designed to handle fail-stop errors for preemptible tightly-coupled applications, but has been extended to other application and resilience frameworks. We provide some background and survey various scenarios to assess the usefulness and limitations of the formula, both for preemptible applications and workflow applications represented as a graph of tasks. We also discuss scenarios with uncertainties, and extend the study to silent errors. We exhibit cases where the optimal period is of a different order than that dictated by the Young/Daly formula, and finally we explain how checkpointing can be further combined with replication.

As a follow-up of our work published the previous year at FTXS’2023, a workshop co-located with SC’2023 (Barbut et al. 2023), we have considered checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We have shown the difficulty of the problem with small examples: it turns out that the optimal checkpointing strategy neither always uses periodic checkpoints nor always takes its last checkpoint exactly at the end of the reservation. Then, we have introduced a dynamic heuristic that is periodic and decides for the checkpointing frequency based upon thresholds for the time left; we have determined threshold times Tn such that it is best to plan for exactly n checkpoints if the time left (or initially the length of the reservation) is between Tn and Tn+1. Next, we have used time discretization and designed a (complicated) dynamic programming algorithm that computes the optimal solution, without any restriction on the checkpointing strategy. Finally, we have reported the results of an extensive simulation campaign that shows that the optimal solution is far more efficient than the Young/Daly periodic approach for short or mid-size reservations. These results have been published at FTXS’2024, see (Benoit et al. 2024).

Results for 2025/2026

This year, we have investigated how to protect numerical iterative algorithms from all types of errors that can strike at scale: fail-stop errors (a.k.a. failures) and silent errors, striking both as computation errors and memory bit-flips. We have combined various techniques: detectors for computation errors, checksums for memory errors, and checkpoint/restart for failures. The objective is to minimize the expected time per iteration of the algorithm. We designed a hierarchical pattern that combines and interleaves all these fault-tolerance mechanisms, and we determined the optimal periodic pattern that achieves this objective. We instantiated these results for the performance analysis of the Preconditioned Conjugate Gradient (PCG) algorithm, reporting several scenarios where the optimal pattern dramatically decreases the overhead due to error mitigation. These results have been published in IJHPCA, see (Tremodeux et al. 2025).

While the previous work considered perfect detectors that detect all errors, we have also investigated what happens when the verification mechanism is not perfect, i.e., it fails to detect some errors. In this setting, we consider that silent errors strike at each iteration with some probability. An error striking at iteration I will be detected only after iteration (I-1)+X, where X is a random variable obeying a probability distribution with bounded support [1, D] such as a truncated geometric distribution. Intuitively, the error silently amplifies during some iterations before it can be detected at distance X or higher. As a consequence, when taking a verification before a checkpoint, there is the risk of missing an error that has struck recently but cannot be detected yet. The simplest solution is to keep two checkpoints in memory, and to use two segments of D-1 iterations, each followed by a verification and a checkpoint: in steady state, perform D-1 iterations after the last checkpoint and take a verification; (i) if successful, we can safely erase the oldest checkpoint, because the most recent checkpoint is necessarily verified. We take a new checkpoint to replace the old one, which is not verified yet; (ii) otherwise, we need to rollback to the oldest checkpoint, not to the last one which may not be verified. Can this simple scheme perform better than replication? What is the optimal number of segments (hence of checkpoints) to keep in memory, and what is the length of these segments? We have answered these questions, both theoretically and through Monte Carlo simulations, and reported these results in a publication at EuroPar’25, see (Benoit et al. 2025).

Visits and meetings

Aurélien Cavelan (INRIA) visited Franck Cappello (ANL) in Chicago for three months (March, April, and May 2016) to initiate the project. Furthermore, we have been meeting regularly in the previous years. In particular, we have been attending the SC conference (November 2016 and November 2017), where we had extensive discussions to make progress. We represented the JLESC at the Inria booth during these conferences.

When not meeting in person, we have stayed in close collaboration through regular Skype meetings, which allowed us to make progress on the project.

Yves Robert (INRIA) made several visits in 2018/2019 to Univ. Tenn. Knoxville, for a total of approximately two months, and a total of two months and a half for 2019/2020.

Valentin Le Fèvre (INRIA) has visited Univ. Tenn. Knoxville for 10 days in February 2019, and for 10 days in January 2020.

Due to the Covid-19 sanitary situation, we have not had any visits for two years (March 2020 - February 2022), but we had numerous virtual interactions.

Yves Robert (INRIA) made three visits to Univ. Tenn. Knoxville in 2022, for a total of approximately one month.

Yves Robert (INRIA) made four visits to Univ. Tenn. Knoxville in 2023, for a total of approximately one month and a half.

Yves Robert (INRIA) made one visit to Univ. Tenn. Knoxville in 2024 for 9 days. The partners have attended SC’24 in Atlanta, USA, and had project discussions during the conference.

The partners met at the 17th JLESC workshop at Argonne National Lab, USA, in May 2025.

Impact and publications

Two papers have been published in FTXS’17 (Benoit et al. 2017), (Benoit et al. 2017).

The work combining fail-stop and silent errors has been published in JPDC (Benoit et al. 2018).

A work on executing workflows on high-bandwidth memory architectures was published in ICPP’18 (Benoit et al. 2018).

The work on optimal cooperative checkpointing for shared high-performance computing platforms was the best paper at APDCM’18 (Hérault et al. 2018).

The work studying whether moldable applications perform better on failure-prone HPC platforms was published in Resilience’18 (Fèvre et al. 2018).

The work on replication with checkpointing was published at SC’19 (Benoit et al. 2019).

The work on the comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication was published at Resilience’20 (Fèvre et al. 2020).

The work on resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms was published in IEEE TC in 2021 (Benoit et al. 2021).

In 2022, two joint publications were published from the project, the first one to assess the usefulness and limitations of the Young/Daly formula for checkpointing, in the IC3 conference (Benoit et al. 2022), and the other one to compare distributed termination detection algorithms for modern HPC platform, in the IJNC journal (Bosilca et al. 2022).

In 2023, we have published one joint publication (Barbut et al. 2023) on when to checkpoint at the end of a fixed-length reservation.

In 2024, we have published a survey on the FGCS special issue focusing on JLESC collaboration results (Bautista-Gomez et al. 2024), and a publication on checkpointing strategies for a fixed-length execution (Benoit et al. 2024).

In 2025, we have published two results on resilience for iterative applications: a journal (Tremodeux et al. 2025) where several types of errors are considered, and a conference (Benoit et al. 2025) focusing on silent errors with non-perfect detectors.

Tremodeux, Alix, Anne Benoit, Emmanuel Agullo, Thomas Herault, Luc Giraud, and Yves Robert. 2025. “Fault-Tolerant Numerical Iterative Algorithms at Scale.” International Journal of High Performance Computing Applications. https://inria.hal.science/hal-05234063.

@article{TremodeuxEtAl2025,
  title = {{Fault-tolerant numerical iterative algorithms at scale}},
  author = {Tremodeux, Alix and Benoit, Anne and Agullo, Emmanuel and Herault, Thomas and Giraud, Luc and Robert, Yves},
  url = {https://inria.hal.science/hal-05234063},
  journal = {{International Journal of High Performance Computing Applications}},
  publisher = {{SAGE Publications}},
  year = {2025},
  keywords = {Numerical iterative algorithms ; computation errors ; memory bit-flips ; fail-stop errors ; Preconditionated Conjugate Gradient},
  pdf = {https://inria.hal.science/hal-05234063v1/file/ijhpca-hal.pdf},
  hal_id = {hal-05234063},
  hal_version = {v1}
}

Benoit, Anne, Thomas Herault, Yves Robert, and Alix Tremodeux. 2025. “Partial Detectors Versus Replication To Cope With Silent Errors.” In Euro-Par 2025 - 31st International European Conference on Parallel and Distributed Computing. Dresden, Germany. https://inria.hal.science/hal-05231852.

@inproceedings{BenoitEtAl2025,
  title = {{Partial Detectors Versus Replication To Cope With Silent Errors}},
  author = {Benoit, Anne and Herault, Thomas and Robert, Yves and Tremodeux, Alix},
  url = {https://inria.hal.science/hal-05231852},
  booktitle = {{Euro-Par 2025 - 31st International European Conference on Parallel and Distributed Computing}},
  address = {Dresden, Germany},
  year = {2025},
  month = aug,
  pdf = {https://inria.hal.science/hal-05231852v1/file/europar.pdf},
  hal_id = {hal-05231852},
  hal_version = {v1}
}

Bautista-Gomez, Leonardo, Anne Benoit, Sheng Di, Thomas Herault, Yves Robert, and Hongyang Sun. 2024. “A Survey on Checkpointing Strategies: Should We Always Checkpoint à La Young/Daly?” Future Generation Computer Systems 161: 315–28. https://doi.org/https://doi.org/10.1016/j.future.2024.07.022.

@article{BautistaEtAl2024,
  title = {{A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?}},
  journal = {Future Generation Computer Systems},
  volume = {161},
  pages = {315-328},
  year = {2024},
  issn = {0167-739X},
  doi = {https://doi.org/10.1016/j.future.2024.07.022},
  url = {https://www.sciencedirect.com/science/article/pii/S0167739X24003777},
  author = {Bautista-Gomez, Leonardo and Benoit, Anne and Di, Sheng and Herault, Thomas and Robert, Yves and Sun, Hongyang},
  keywords = {Checkpointing, Optimal period, Young/Daly formula, Resilience},
  hal_id = {hal-04767137},
  hal_version = {v1}
}

Benoit, Anne, Lucas Perotin, Yves Robert, and Frédéric Vivien. 2024. “Checkpointing Strategies for a Fixed-Length Execution.” In 14th Workshop on Fault-Tolerance for HPC at EXtreme Scale (FTXS 2024). Atlanta, United States. https://inria.hal.science/hal-04714372.

@inproceedings{BenoitEtAl2024,
  title = {{Checkpointing strategies for a fixed-length execution}},
  author = {Benoit, Anne and Perotin, Lucas and Robert, Yves and Vivien, Fr{\'e}d{\'e}ric},
  url = {https://inria.hal.science/hal-04714372},
  booktitle = {{14th Workshop on Fault-Tolerance for HPC at eXtreme Scale (FTXS 2024)}},
  address = {Atlanta, United States},
  year = {2024},
  month = nov,
  keywords = {Checkpointing ; failures ; fail-stop errors ; fixed-length execution ; fixed-size reservation},
  pdf = {https://inria.hal.science/hal-04714372v1/file/ckpt-resa.pdf},
  hal_id = {hal-04714372},
  hal_version = {v1}
}

Barbut, Quentin, Anne Benoit, Thomas Herault, Yves Robert, and Frédéric Vivien. 2023. “When to Checkpoint at the End of a Fixed-Length Reservation?” In Proceedings of Fault Tolerance for HPC at EXtreme Scales (FTXS) Workshop. https://inria.hal.science/hal-04215554.

@inproceedings{BarbutEtAl2023,
  author = {Barbut, Quentin and Benoit, Anne and Herault, Thomas and Robert, Yves and Vivien, Frédéric},
  title = {When to checkpoint at the end of a fixed-length reservation?},
  booktitle = {Proceedings of Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop},
  url = {https://inria.hal.science/hal-04215554},
  location = {Denver, United States},
  date = {2023-11-12},
  year = {2023}
}

Benoit, Anne, Yishu Du, Thomas Herault, Loris Marchal, Guillaume Pallez, Lucas Perotin, Yves Robert, Hongyang Sun, and Frederic Vivien. 2022. “Checkpointing à La Young/Daly: An Overview.” In Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing, 701–10. IC3-2022. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3549206.3549328.

@inproceedings{BenoitEtAl2022,
  author = {Benoit, Anne and Du, Yishu and Herault, Thomas and Marchal, Loris and Pallez, Guillaume and Perotin, Lucas and Robert, Yves and Sun, Hongyang and Vivien, Frederic},
  title = {Checkpointing \`{a} La Young/Daly: An Overview},
  year = {2022},
  isbn = {9781450396752},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3549206.3549328},
  doi = {10.1145/3549206.3549328},
  booktitle = {Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing},
  pages = {701–710},
  numpages = {10},
  location = {Noida, India},
  series = {IC3-2022}
}

Bosilca, George, Aurélien Bouteiller, Thomas Herault, Valentin Le Fèvre, Yves Robert, and Jack Dongarra. 2022. “Comparing Distributed Termination Detection Algorithms for Modern HPC Platforms.” Int. J. of Networking and Computing 12 (1): 26–46.

@article{BosilcaEtAl2022,
  title = {{Comparing distributed termination detection algorithms for modern HPC platforms}},
  author = {Bosilca, George and Bouteiller, Aurélien and Herault, Thomas and Fèvre, Valentin Le and Robert, Yves and Dongarra, Jack},
  journal = {Int. J. of Networking and Computing},
  year = {2022},
  volume = {12},
  number = {1},
  pages = {26-46}
}

Du, Yishu, Guillaume Pallez, Loris Marchal, and Yves Robert. 2022. “Optimal Checkpointing Strategies for Iterative Applications.” IEEE Trans. Parallel Distributed Systems 33 (3): 507–22.

@article{DuEtAl2022,
  author = {Du, Yishu and Pallez, Guillaume and Marchal, Loris and Robert, Yves},
  journal = {IEEE Trans. Parallel Distributed Systems},
  volume = {33},
  pages = {507-522},
  title = {Optimal checkpointing strategies for iterative applications},
  number = {3},
  year = {2022}
}

Benoit, Anne, Valentin Le Fèvre, Lucas Perotin, Padma Raghavan, Yves Robert, and Hongyang Sun. 2021. “Resilient Scheduling of Moldable Parallel Jobs to Cope with Silent Errors.” IEEE Transactions on Computers.

@article{BenoitEtAl2021,
  author = {Benoit, Anne and Fèvre, Valentin Le and Perotin, Lucas and Raghavan, Padma and Robert, Yves and Sun, Hongyang},
  journal = {IEEE Transactions on Computers},
  title = {Resilient scheduling of moldable parallel jobs to cope with silent errors},
  year = {2021}
}

Fèvre, Valentin Le, Thomas Herault, Julien Langou, and Yves Robert. 2020. “A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication.” In Resilience: 13th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Jointly Published with Euro-Par 2020. LNCS. Springer Verlag.

@inproceedings{LeFevreEtAl2020,
  author = {Fèvre, Valentin Le and Herault, Thomas and Langou, Julien and Robert, Yves},
  booktitle = {{Resilience}: 13th Workshop on Resiliency in High Performance Computing
    in Clusters, Clouds, and Grids,
                  jointly published with {Euro-Par 2020}},
  title = {A comparison of several fault-tolerance methods for the detection and correction of floating-point errors
  in matrix-matrix multiplication},
  year = {2020},
  series = {LNCS},
  publisher = {Springer Verlag}
}

Benoit, Anne, Thomas Hérault, Valentin Le Fèvre, and Yves Robert. 2019. “Replication Is More Efficient than You Think.” In SC’2019, the IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis. ACM Press.

@inproceedings{BenoitEtAl2019,
  author = {Benoit, Anne and Hérault, Thomas and Fèvre, Valentin Le and Robert, Yves},
  booktitle = {{SC'2019}, the IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis},
  title = {Replication is more efficient than you think},
  publisher = {ACM Press},
  year = {2019}
}

Fèvre, Valentin Le, George Bosilca, Aurelien Bouteiller, Thomas Herault, Atsushi Hori, Yves Robert, and Jack Dongarra. 2018. “Do Moldable Applications Perform Better on Failure-Prone HPC Platforms?” In Resilience: 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Jointly Published with Euro-Par 2018. LNCS. Springer Verlag.

@inproceedings{LeFevreEtAl2018,
  author = {Fèvre, Valentin Le and Bosilca, George and Bouteiller, Aurelien and Herault, Thomas and Hori, Atsushi and Robert, Yves and Dongarra, Jack},
  booktitle = {{Resilience}: 11th Workshop on Resiliency in High Performance Computing
    in Clusters, Clouds, and Grids,
                  jointly published with {Euro-Par 2018}},
  title = {{Do moldable applications perform better
  on failure-prone HPC platforms?}},
  year = {2018},
  series = {LNCS},
  publisher = {Springer Verlag}
}

Hérault, Thomas, Yves Robert, Aurélien Bouteiller, Dorian Arnold, Kurt Ferreira, George Bosilca, and Jack Dongarra. 2018. “Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms.” In 20th Workshop on Advances in Parallel and Distributed Computational Models APDCM 2018. IEEE Computer Society Press.

@inproceedings{HeraultEtAl2018,
  author = {Hérault, Thomas and Robert, Yves and Bouteiller, Aurélien and Arnold, Dorian and Ferreira, Kurt and Bosilca, George and Dongarra, Jack},
  booktitle = {20th Workshop on Advances in Parallel and Distributed
                                Computational Models {APDCM 2018}},
  publisher = {IEEE Computer Society Press},
  title = {Optimal cooperative checkpointing for shared high-performance computing platforms},
  year = {2018}
}

Benoit, Anne, Swann Perarnau, Loïc Pottier, and Yves Robert. 2018. “A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures.” In ICPP’2018, the 47th Int. Conf. on Parallel Processing. IEEE Computer Society Press.

@inproceedings{BenoitEtAl2018b,
  author = {Benoit, Anne and Perarnau, Swann and Pottier, Loïc and Robert, Yves},
  booktitle = {{ICPP'2018}, the 47th Int. Conf. on Parallel Processing},
  title = {A performance model to execute workflows on high-bandwidth memory architectures},
  publisher = {{IEEE} Computer Society Press},
  year = {2018}
}

Benoit, Anne, Aurélien Cavelan, Franck Cappello, Padma Raghavan, Yves Robert, and Hongyang Sun. 2018. “Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing.” J. Parallel and Distributed Computing.

@article{BenoitEtAl2018,
  author = {Benoit, Anne and Cavelan, Aurélien and Cappello, Franck and Raghavan, Padma and Robert, Yves and Sun, Hongyang},
  journal = {J. Parallel and Distributed Computing},
  title = {Coping with silent and fail-stop errors at scale by combining replication and checkpointing},
  year = {2018},
  badvolume = {98},
  badpages = {8-24}
}

———. 2017. “Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale.” In Proceedings of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).

@inproceedings{benoitEtAl2017identifying,
  title = {Identifying the right replication level to detect and correct silent errors at scale},
  author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Cappello, Franck and Raghavan, Padma and Robert, Yves and Sun, Hongyang},
  year = {2017},
  booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
  keywords = {mine,Workshop}
}

Benoit, Anne, Aurélien Cavelan, Valentin Le Fèvre, and Yves Robert. 2017. “Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms.” In Proceedings of the 7th Workshop on Fault Tolerance for HPC at EXtreme Scale (FTXS).

@inproceedings{benoitEtAl2017optimal,
  title = {Optimal checkpointing period with replicated execution on heterogeneous platforms},
  author = {Benoit, Anne and Cavelan, Aur{\'e}lien and Le F{\`e}vre, Valentin and Robert, Yves},
  booktitle = {Proceedings of the 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)},
  year = {2017},
  keywords = {mine,Workshop}
}

Future plans

There remains a lot to explore for workflow applications, consisting of tasks. We have so far focused only on duplication in this case, but one may want to consider different replication levels (duplication, triplication or more) to different tasks, depending upon their criticality in terms of longest paths, number of successors, etc. This may be even more important when considering a general directed acyclic graph of tasks, rather than restricting to linear chains of tasks. This topic is called partial replication, and even though it has been empirically studied by some previous work, designing an optimal strategy that combines partial redundancy and checkpointing and analyzing its efficacy remain to be done.

Finally, our initial goal was to target pipelined workflow applications, where data continuously enters the workflow, and where the objective is to maximize the throughput that can be achieved. This causes several new challenges that we hope to address in the future.

Former members

Aurélien Cavelan (INRIA), Valentin Le Fèvre (INRIA), Li Han (INRIA), Yishu Du (INRIA).

References

Han, Li, Louis-Claude Canon, Henri Casanova, Yves Robert, and Frédéric Vivien. 2018. “Checkpointing Workflows for Fail-Stop Errors.” IEEE Trans. Computers 67 (8): 1105–20.

@article{HanEtAl2018b,
  author = {Han, Li and Canon, Louis-Claude and Casanova, Henri and Robert, Yves and Vivien, Frédéric},
  journal = {IEEE Trans. Computers},
  volume = {67},
  pages = {1105-1120},
  title = {Checkpointing workflows for fail-stop errors},
  number = {8},
  year = {2018}
}

Han, Li, Valentin Le Fèvre, Louis-Claude Canon, Yves Robert, and Frédéric Vivien. 2018. “A Generic Approach to Scheduling and Checkpointing Workflows.” In ICPP2018, The 47th Int. Conf. on Parallel Processing. IEEE Computer Society Press.

@inproceedings{HanEtAl2018,
  author = {Han, Li and Fèvre, Valentin Le and Canon, Louis-Claude and Robert, Yves and Vivien, Frédéric},
  booktitle = {ICPP2018, the 47th Int. Conf. on Parallel Processing},
  title = {A Generic Approach to  Scheduling and Checkpointing Workflows},
  publisher = {{IEEE} Computer Society Press},
  year = {2018}
}

Casanova, Henri, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2015. “On the Impact of Process Replication on Executions of Large-Scale Parallel Applications with Coordinated Checkpointing.” Future Generation Comp. Syst. 51: 7–19. https://doi.org/10.1016/j.future.2015.04.003.

@article{CasanovaEtAl2015,
  author = {Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
  title = {On the impact of process replication on executions of large-scale
                 parallel applications with coordinated checkpointing},
  journal = {Future Generation Comp. Syst.},
  volume = {51},
  pages = {7--19},
  year = {2015},
  url = {http://dx.doi.org/10.1016/j.future.2015.04.003},
  doi = {10.1016/j.future.2015.04.003},
  timestamp = {Thu, 31 Mar 2016 15:45:29 +0200},
  biburl = {http://dblp.uni-trier.de/rec/bib/journals/fgcs/CasanovaRVZ15},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Herault, Thomas, and Yves Robert. 2015. Fault-Tolerance Techniques for High-Performance Computing. 1st ed. Springer Publishing Company, Incorporated.

@book{HeraultEtAl2015,
  author = {Herault, Thomas and Robert, Yves},
  title = {Fault-Tolerance Techniques for High-Performance Computing},
  year = {2015},
  isbn = {3319209426, 9783319209425},
  edition = {1st},
  publisher = {Springer Publishing Company, Incorporated}
}

Bougeret, Marin, Henri Casanova, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2014. “Using Group Replication for Resilience on Exascale Systems.” IJHPCA 28 (2): 210–24. https://doi.org/10.1177/1094342013505348.

@article{BougeretEtAl2014,
  author = {Bougeret, Marin and Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
  title = {Using group replication for resilience on exascale systems},
  journal = {{IJHPCA}},
  volume = {28},
  number = {2},
  pages = {210--224},
  year = {2014},
  url = {http://dx.doi.org/10.1177/1094342013505348},
  doi = {10.1177/1094342013505348},
  timestamp = {Mon, 02 Jun 2014 09:36:01 +0200},
  biburl = {http://dblp.uni-trier.de/rec/bib/journals/ijhpca/BougeretCRVZ14},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Ferreira, Kurt B., Jon Stearley, James H. Laros III, Ron Oldfield, Kevin T. Pedretti, Ron Brightwell, Rolf Riesen, Patrick G. Bridges, and Dorian C. Arnold. 2011. “Evaluating the Viability of Process Replication Reliability for Exascale Systems.” In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, 44:1–44:12. https://doi.org/10.1145/2063384.2063443.

@inproceedings{FerreiraEtAl2011,
  author = {Ferreira, Kurt B. and Stearley, Jon and III, James H. Laros and Oldfield, Ron and Pedretti, Kevin T. and Brightwell, Ron and Riesen, Rolf and Bridges, Patrick G. and Arnold, Dorian C.},
  title = {Evaluating the viability of process replication reliability for exascale
                 systems},
  booktitle = {Conference on High Performance Computing Networking, Storage and Analysis,
                 {SC} 2011, Seattle, WA, USA, November 12-18, 2011},
  pages = {44:1--44:12},
  year = {2011},
  crossref = {DBLP:conf/sc/2011},
  url = {http://doi.acm.org/10.1145/2063384.2063443},
  doi = {10.1145/2063384.2063443},
  timestamp = {Tue, 30 Jun 2015 16:34:04 +0200},
  biburl = {http://dblp.uni-trier.de/rec/bib/conf/sc/FerreiraSLOPBRBA11},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}