Hybrid resilience for MPI + tasks codes

Research topic and goals

The research topic is to design a fault tolerance protocol for applications that adopt the hybrid OmpSs+MPI programming model.

Results for 2015/2016

The contributors introduced an extended version of NanoCheckpoints (Subasi et al. 2015) that provides a resiliency solution for OmpSs+MPI applications. It can gracefully handle faults by rolling back and restarting tasks in which a fault has occurred and transparently resolves recovery of tasks that have MPI calls inside thanks to the message logging.

Evaluation of the execution in the presence of faults showed that task granularity and coupling play a very important role in hiding task recovery: The more there are tasks that can be executed independently while some other task is recovering from a fault, the less impact faults will have on the total execution time. However, if the program was not taskified well, recovery of even one task may slow down the program significantly.

In summary, the contributions were:

  • A scalable fault tolerance protocol for hybrid task-parallel message passing applications that has a reasonable fault-free overhead.
  • An extended evaluation of a fault-free execution as well as an execution with faults and discussed what may have a big impact on the execution in both cases.

Visits and meetings

Tatiana V. Martsinkevich (INRIA) visited BSC for three months in summer 2015.

Impact and publications

A paper (Martsinkevich et al. 2015) has been published in Cluster 2015 proceedings as part of FTS 2015 workshop.

  1. Martsinkevich, Tatiana V., Omer Subasi, Osman S. Unsal, Franck Cappello, and Jesús Labarta. 2015. “Fault-Tolerant Protocol For Hybrid Task-Parallel Message-Passing Applications.” In 2015 IEEE International Conference On Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015, 563–70. doi:10.1109/CLUSTER.2015.104.
    @inproceedings{MartsinkevichEtAl2015,
      author = {Martsinkevich, Tatiana V. and Subasi, Omer and Unsal, Osman S. and Cappello, Franck and Labarta, Jes{\'{u}}s},
      title = {Fault-Tolerant Protocol for Hybrid Task-Parallel Message-Passing Applications},
      booktitle = {2015 {IEEE} International Conference on Cluster Computing, {CLUSTER}
                     2015, Chicago, IL, USA, September 8-11, 2015},
      pages = {563--570},
      year = {2015},
      url = {http://dx.doi.org/10.1109/CLUSTER.2015.104},
      doi = {10.1109/CLUSTER.2015.104}
    }
    
  2. Subasi, O., J. Arias, O. Unsal, J. Labarta, and A. Cristal. 2015. “NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework For Efficient and Scalable Checkpoint/Restart.” In 2015 23rd Euromicro International Conference On Parallel, Distributed and Network-Based Processing (PDP), 99–102. doi:10.1109/PDP.2015.17.
    @inproceedings{SubasiEtAl2015,
      author = {Subasi, O. and Arias, J. and Unsal, O. and Labarta, J. and Cristal, A.},
      booktitle = {2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)},
      title = {NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart},
      year = {2015},
      pages = {99-102},
      doi = {10.1109/PDP.2015.17},
      issn = {1066-6192},
      month = mar
    }
    

Future plans

There will be a journal submission based on this work.

References