New Techniques to Design Silent Data Corruption Detectors

Research topic and goals

With the increase in number of components and the power constrains for next generation supercomputers, reliability is one of the mayor concerns that needs to be addressed to reach exascale. In addition to fail-stop failures, which produce application crashes, silent data corruption (SDC) is considered one of the most dangerous type of errors, because they can make applications produce wrong results without any notice. SDC are still not well understood, their frequency and occurrence patterns remain mostly unknown. Detecting SDC is challenging due to the silent nature of those errors. Our objective is to develop novel software-level strategies that can detect most SDC occurring in future supercomputers.

Results for 2015/2016

During 2015/2016 we tackled several challenges related to silent corruption (Bautista-Gomez and Cappello 2015) and tested preliminary approaches (Di, Berrocal, and Cappello 2015). We developed a lightweight, adaptive, impact-driven detector for detecting the silent data corruptions(Di and Cappello 2016). (1) We carefully characterize 18 HPC applications/benchmarks and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Another question we answered was related to the use of multiple SDC detectors with different characteristics. We designed a strategy that allows users to decide which SDC detector use and at which frequency, having the choice among several SDC detectors with different performance, recall and precision. This research was published as a full paper at HiPC2015 (Bautista-Gomez et al. 2015) and we wrote an extension full of details and new result and published it (Bautista-Gomez et al. 2016). In addition, we investigated how to create SDC detectors based on support vector machines (SVM) and compare them with the state-of-the-art SDC detectors. Our evaluation with multiple large-scale HPC applications shows that SVM is a good technique that can learn about the behavior of the datasets and detect the vast majority of anomalies while imposing a negligible overhead. Our technique performs better than the existent ones in most of the cases. We published this results at CCGrid2016 (Subasi et al. 2016).

Results for 2016/2017

During 2016/2017 we have implemented a new SDC detection algorithm that leverages multiple regression and detection mechanisms in a dynamic fashion in order to better adapt to the conditions of the execution. Indeed, as the data entropy changes over time and space, as turbulent regions moves in the domain, dynamic SDC detection techniques are a much more robust way to adapt for such changes. Our evaluation shows that the proposed technique achieves a lower false positive rate and similar recall, while imposing a much lower memory overhead than state-of-the-art techniques. This work has been submitted to an international conference and it is currently under review. The software prototype, called MACORD, is pending approval for open source publication. In addition, several separate efforts across have been made in order to test selective replication for dealing with SDC in HPC systems (Subasi et al. 2017), in particular for task-based programming languages (Subasi et al. 2016).

Visits and meetings

  • August 2nd, 2015 - November 6th, 2015: Omer internship at ANL
  • July 4th, 2016 - July 8th, 2016: Franck visit to BSC

Impact and publications

  1. Subasi, Omer, Gulay Yalcin, Ferad Zyulkyarov, Osman Unsal, and Jesus Labarta. 2017. “Designing And Modelling Selective Replication for Fault-Tolerant HPC Applications.” In 2017 IEEE International Conference On Cluster Cloud and Grid Computing (CCGrid’17). IEEE.
    @inproceedings{subasi2017rep,
      title = {Designing and Modelling Selective Replication for Fault-tolerant HPC Applications},
      author = {Subasi, Omer and Yalcin, Gulay and Zyulkyarov, Ferad and Unsal, Osman and Labarta, Jesus},
      booktitle = {2017 IEEE International Conference on Cluster Cloud and Grid Computing (CCGrid'17)},
      year = {2017},
      organization = {IEEE}
    }
    
  2. Di, Sheng, and Franck Cappello. 2016. “ Adaptive-Impact Driven Detection Of Silent Data Corruption for HPC Applications.” IEEE Transactions On Parallel and Distributed Computing. Phoenix, United States.
    @article{ShengEtCappello2016,
      address = {Phoenix, United States},
      author = {Di, Sheng and Cappello, Franck},
      booktitle = {IEEE Transactions on Parallel and Distributed Computing},
      title = { Adaptive-Impact Driven Detection of Silent Data Corruption for HPC Applications},
      year = {2016}
    }
    
  3. Subasi, Omer, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, and Franck Cappello. 2016. “Spatial Support Vector Regression To Detect Silent Errors in the Exascale Era.” In Proceedings Of the 2016 IEEE/ACM International Symposium on Cluster Cloud And Grid Computing. IEEE.
    @inproceedings{SubasiEtAl2016,
      title = {Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era},
      author = {Subasi, Omer and Di, Sheng and Bautista-Gomez, Leonardo and Balaprakash, Prasanna and Unsal, Osman and Labarta, Jesus and Cristal, Adrian and Cappello, Franck},
      booktitle = {Proceedings of the 2016 IEEE/ACM International Symposium on Cluster Cloud and
            Grid Computing},
      organization = {IEEE},
      year = {2016}
    }
    
  4. Bautista-Gomez, Leonardo, Anne Benoit, Aurélien Cavelan, Saurabh K Raina, Yves Robert, and Hongyang Sun. 2016. “Coping With Recall and Precision of Soft Error Detectors.” Journal Of Parallel and Distributed Computing 98. Elsevier: 8–24.
    @article{bautista2016coping,
      title = {Coping with recall and precision of soft error detectors},
      author = {Bautista-Gomez, Leonardo and Benoit, Anne and Cavelan, Aur{\'e}lien and Raina, Saurabh K and Robert, Yves and Sun, Hongyang},
      journal = {Journal of Parallel and Distributed Computing},
      volume = {98},
      pages = {8--24},
      year = {2016},
      publisher = {Elsevier}
    }
    
  5. Subasi, Omer, Gulay Yalcin, Ferad Zyulkyarov, Osman Unsal, and Jesus Labarta. 2016. “A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets.” In 2016 IEEE International Conference On Cluster Computing (CLUSTER’16), 498–505. IEEE.
    @inproceedings{subasi2016run,
      title = {A runtime heuristic to selectively replicate tasks for application-specific reliability targets},
      author = {Subasi, Omer and Yalcin, Gulay and Zyulkyarov, Ferad and Unsal, Osman and Labarta, Jesus},
      booktitle = {2016 IEEE International Conference on Cluster Computing (CLUSTER'16)},
      pages = {498--505},
      year = {2016},
      organization = {IEEE}
    }
    
  6. Bautista-Gomez, Leonardo, Anne Benoit, Aurélien Cavelan, Saurabh K Raina, Yves Robert, and Hongyang Sun. 2015. “Which Verification For Soft Error Detection?” In Proceedings Of the 24th International Conference on High-Performance Performance Computing. IEEE.
    @inproceedings{BautEtAl2015b,
      title = {Which Verification for Soft Error Detection?},
      author = {Bautista-Gomez, Leonardo and Benoit, Anne and Cavelan, Aur{\'e}lien and Raina, Saurabh K and Robert, Yves and Sun, Hongyang},
      year = {2015},
      booktitle = {Proceedings of the 24th International Conference on High-Performance Performance Computing},
      organization = {IEEE}
    }
    
  7. Bautista-Gomez, Leonardo Arturo, and Franck Cappello. 2015. “Detecting And Correcting Data Corruption in Stencil Applications through Multivariate Interpolation.” In Proceedings Of the 2015 IEEE International Conference on Cluster Computing, 595–602. IEEE Computer Society.
    @inproceedings{BautEtAl2015,
      title = {Detecting and correcting data corruption in stencil applications through multivariate interpolation},
      author = {Bautista-Gomez, Leonardo Arturo and Cappello, Franck},
      booktitle = {Proceedings of the 2015 IEEE International Conference on Cluster Computing},
      pages = {595--602},
      year = {2015},
      organization = {IEEE Computer Society}
    }
    
  8. Di, Sheng, Eduardo Berrocal, and Franck Cappello. 2015. “An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications.” In 2015 IEEE/ACM International Symposium On Cluster, Cloud and Grid Computing (CCGrid’15), 271–80. IEEE.
    @inproceedings{di2015detect,
      title = {An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications},
      author = {Di, Sheng and Berrocal, Eduardo and Cappello, Franck},
      booktitle = {2015 IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'15)},
      pages = {271--280},
      year = {2015},
      organization = {IEEE}
    }
    

Person-Month efforts in 2016/2017

Leonardo Bautista Gomez (BSC) 6.0 PM
Prasanna Balaprakash (ANL) 0.5 PM
Anne Benoit (INRIA) 3.0 PM
Franck Cappello (ANL) 0.5 PM
Aurélien Cavelan (INRIA) 3.0 PM
Yves Robert (INRIA) 3.0 PM
Omer Subasi (BSC) 12.0 PM
Hongyang Sun (INRIA) 3.0 PM
Osman Unsal (BSC) 6.0 PM
Sheng Di (ANL) 0.5 PM

Future plans

The project will be suspended as most of the partners in the project have started working on different topics, some of them related to resilience others in other domains. Some of these new interests are forming new collaboration projects within the JLESC.

References