Mitigating I/O Interference in Concurrent HPC Applications

  • Head
  • Dorier Matthieu (ANL)
  • Members
  • Antoniu Gabriel (INRIA)
  • Ibrahim Shadi (INRIA)
  • Yildiz Orcun (ANL)
  • Ross Rob (ANL)

Research topic and goals

With million-core supercomputers comes the problem of I/O interference between distinct applications accessing a shared file system in a concurrent manner (Lofstead et al. 2010). Our work in this context is twofold. We aim to investigate and quantify this interference effect, as well as find its root causes, and we aimto mitigate I/O interference through novel approaches based on scheduling and cross-application communication and coordination. In previous work, experiments done during Matthieu Dorier’s internship at ANL (2013) led to a better understanding of the I/O interference phenomena, and to the implementation of a prototype of the CALCioM approach with currently includes 3 scheduling strategies. As a result of this work, a paper was published at IEEE IPDPS 2014 (Dorier et al. 2014).

Results for 2014/2015

Sub-goal 1:

Having exemplified the interference phenomenon on synthetic benchmarks, we are now interested in showing how often such interference occurs and the nature of the applications that are involved in this phenomenon. This investigation was done through the analysis of traces produced by the Darshan library on ANL’s Intrepid BlueGene/P system.

Results:

We developed Darshan-Ruby and Darshan-Web (http://darshan-ruby.gforge.inria.fr). Darshan-Ruby is a Ruby wrapper to ANL’s Darshan library. Darshan-Web is a Web platform for online analysis of Darshan log files. This platform is based on Ruby on Rails, D3.js, and AJAX technologies. A demo is available here: http://darshan-web.irisa.fr.

Sub-goal 2:

Our second goal was to find a way to improve CALCioM by modeling and predicting I/O patterns. This prediction should be made at run time, with no prior knowledge of the application, and should converge toward an accurate model of the application’s I/O within a few iterations only.

Results:

To this end, we developed Omnisc’IO, an approach that leverages format grammars to model and predict the I/O behavior of HPC applications. Omnisc’IO was evaluated with four real application: CM1 (Bryan and Fritsch. 2002), Nek5000 (James W. Lottes and Kerkemeier 2008), LAMMPS (2010) and GTC (2010), and our results led to a paper at SC’14 (Dorier et al. 2014).

Results for 2015/2016

Sub-goal 1:

We continued maintaining and developing Darshan-Ruby in order to adapt it to the new Darshan 3 format.

Results:

The development of Darshan-Ruby for Darshan 3 was moved to Argonne (https://xgitlab.cels.anl.gov/darshan/darshan-ruby). A new tool called Quarshan was developed to efficiently query a large number of log files and perform operations on Darshan data.

Sub-goal 2:

Research efforts from the literature on mitigating I/O interference focus on a single potential cause of interference (e.g., the network). Yet the root causes of I/O interference can be diverse. In this research direction, we aim to better understand the root causes of I/O interference, and to propose new I/O scheduling techniques to solve the interference issue.

Results:

We conducted an extensive experimental campaign to explore the various root causes of I/O interference in HPC storage systems. We used microbenchmarks on the Grid’5000 testbed to evaluate how the applications’ access pattern, the network components, the file system’s configuration, and the backend storage devices influence I/O interference. The results of this campaign have been published at the IPDPS 2016 conference (Yildiz et al. 2016).

Visits and meetings

  • June 2, 2014 - June 6: Rob Ross visited KerData in Rennes.
  • June 9, 2014 - June 11: 1st workshop of the JLESC held in Nice, France.
  • November 24, 2014 - November 26: Meetings for updates and planning were held during the 2nd JLESC workshop
  • July - September 2015: Internship of Orcun Yildiz at Argonne National Laboratory.

Impact and publications

  1. Yildiz, Orcun, Matthieu Dorier, Shadi Ibrahim, Rob Ross, and Gabriel Antoniu. 2016. “On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems.” In IPDPS - International Parallel and Distributed Processing Symposium. Chicago, United States. https://hal.inria.fr/hal-01270630.
    @inproceedings{YildizIPDPS2016,
      title = {{On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems}},
      author = {Yildiz, Orcun and Dorier, Matthieu and Ibrahim, Shadi and Ross, Rob and Antoniu, Gabriel},
      url = {https://hal.inria.fr/hal-01270630},
      booktitle = {{IPDPS - International Parallel and Distributed Processing Symposium}},
      address = {Chicago, United States},
      year = {2016},
      month = may,
      keywords = {Exascale I/O ; Parallel File Systems ; Cross-Application Contention ; Interference},
      pdf = {https://hal.inria.fr/hal-01270630/file/IPDPS%2716-CR.pdf},
      hal_id = {hal-01270630},
      hal_version = {v1}
    }
    
  2. Dorier, Matthieu, Shadi Ibrahim, Gabriel Antoniu, and Rob Ross. 2014. “Omnisc’IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 623–34. SC ’14. Piscataway, NJ, USA: IEEE Press. https://doi.org/10.1109/SC.2014.56.
    @inproceedings{DorierEtAl2014a,
      acmid = {2683662},
      address = {Piscataway, NJ, USA},
      author = {Dorier, Matthieu and Ibrahim, Shadi and Antoniu, Gabriel and Ross, Rob},
      booktitle = {Proceedings of the International Conference for High Performance Computing,
          Networking, Storage and Analysis},
      doi = {10.1109/SC.2014.56},
      isbn = {978-1-4799-5500-8},
      keywords = {HPC, I/O, Omnisc'IO, exascale, grammar, prediction, storage},
      location = {New Orleans, Louisana},
      numpages = {12},
      pages = {623--634},
      publisher = {IEEE Press},
      series = {SC '14},
      title = {Omnisc'IO: A Grammar-based Approach to Spatial and Temporal I/O Patterns Prediction},
      year = {2014}
    }
    
  3. Dorier, Matthieu, Gabriel Antoniu, Robert Ross, Dries Kimpe, and Shadi Ibrahim. 2014. “CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination.” In IPDPS - International Parallel and Distributed Processing Symposium. Phoenix, United States. https://hal.inria.fr/hal-00916091.
    @inproceedings{DorierEtAl2014b,
      address = {Phoenix, United States},
      author = {Dorier, Matthieu and Antoniu, Gabriel and Ross, Robert and Kimpe, Dries and Ibrahim, Shadi},
      booktitle = {IPDPS - International Parallel and Distributed Processing Symposium},
      hal_id = {hal-00916091},
      hal_version = {v1},
      month = may,
      pdf = {https://hal.inria.fr/hal-00916091/file/CALCioM.pdf},
      title = {CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application
          Coordination},
      url = {https://hal.inria.fr/hal-00916091},
      year = {2014}
    }
    

References

  1. Lofstead, Jay, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, Karsten Schwan, and Matthew Wolf. 2010. “Managing Variability in the IO Performance of Petascale Storage Systems.” In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 1–12. SC ’10. Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/SC.2010.32.
    @inproceedings{LofsteadEtAl2010,
      acmid = {1884679},
      address = {Washington, DC, USA},
      author = {Lofstead, Jay and Zheng, Fang and Liu, Qing and Klasky, Scott and Oldfield, Ron and Kordenbrock, Todd and Schwan, Karsten and Wolf, Matthew},
      booktitle = {Proceedings of the 2010 ACM/IEEE International Conference for High Performance
          Computing, Networking, Storage and Analysis},
      doi = {10.1109/SC.2010.32},
      isbn = {978-1-4244-7559-9},
      numpages = {12},
      pages = {1--12},
      publisher = {IEEE Computer Society},
      series = {SC '10},
      title = {Managing Variability in the IO Performance of Petascale Storage Systems},
      year = {2010}
    }
    
  2. 2010. http://phoenix.ps.uci.edu/GTC/.
    @misc{GTC2010,
      url = {http://phoenix.ps.uci.edu/GTC/},
      year = {2010}
    }
    
  3. 2010. http://lammps.sandia.gov/.
    @misc{LAMMPS2010,
      url = {http://lammps.sandia.gov/},
      year = {2010}
    }
    
  4. James W. Lottes, P. F. Fischer, and Stefan G. Kerkemeier. 2008. http://nek5000.mcs.anl.gov.
    @misc{LottesEtAL2008,
      author = {James W. Lottes, P. F. Fischer and Kerkemeier, Stefan G.},
      url = {http://nek5000.mcs.anl.gov},
      year = {2008}
    }
    
  5. Bryan, George H., and J. Michael Fritsch. 2002. “A Benchmark Simulation for Moist Nonhydrostatic Numerical Models.” https://doi.org/10.1175/1520-0493(2002)130%3C2917:ABSFMN%3E2.0.CO;2.
    @article{BryanFritsch2002,
      author = {Bryan, George H. and Fritsch., J. Michael},
      doi = {10.1175/1520-0493(2002)130%3C2917:ABSFMN%3E2.0.CO;2},
      title = {A Benchmark Simulation for Moist Nonhydrostatic Numerical Models},
      year = {2002}
    }