Towards accurate network utilization forecasting using portable MPI-level monitoring

Research topic and goals

The goal of this project is to study how a careful monitoring of MPI communications can help in forecasting communication to avoid congestion on the network when writing checkpoints. This work will be based on the low-level monitoring interface that has been implemented by Inria and UTK in OpenMPI (George et al. 2017). We want to monitor applications communication with this feature and, using time-series analysis and other techniques, predict the future usage of the network by the application. With such prediction we will schedule I/O access of VeloC (“Very Low Overhead transparent multilevel Checkpoint/restart”), to avoid interference between the checkpoint writing to the storage system and the usage of the network by the application.

Contributions:

  • A transparent application monitoring system within VeloC
  • A network tool that predicts network usage of the application
  • Strategies to avoid network interference between the application and VeloC.

Results for 2019/2020

We have proposed a portable deep learning predictor that only uses the information available through MPI introspection to construct a recurrent sequence-to-sequence neural network capable of forecasting network utilization (Tseng et al. 2019).

Visits and meetings

Emmanuel Jeannot Visited ANL in June2019. Emmanuel Jeannot, George Bosilca and Bogdan Nicolae met at SC 19 in Denver.

Impact and publications

  1. Tseng, Shu-Mei, Bogdan Nicolae, George Bosilca, Emmanuel Jeannot, Aparna Chandramowlishwaran, and Franck Cappello. 2019. “Towards Portable Online Prediction of Network Utilization Using MPI-Level Monitoring.” In EuroPar’19: 25th International European Conference on Parallel and Distributed Systems. Goettingen, Germany. https://hal.inria.fr/hal-02184204.
    @inproceedings{tnb19+,
      title = {{Towards Portable Online Prediction of Network Utilization using MPI-level Monitoring}},
      author = {Tseng, Shu-Mei and Nicolae, Bogdan and Bosilca, George and Jeannot, Emmanuel and Chandramowlishwaran, Aparna and Cappello, Franck},
      url = {https://hal.inria.fr/hal-02184204},
      booktitle = {{EuroPar'19: 25th International European Conference on Parallel and Distributed Systems}},
      address = {Goettingen, Germany},
      year = {2019},
      month = aug,
      keywords = {Work stealing ; Prediction of resource utilization ; Timeseries forecasting ; Network monitoring ; Online learning},
      pdf = {https://hal.inria.fr/hal-02184204/file/paper.pdf},
      hal_id = {hal-02184204},
      hal_version = {v1}
    }
    

Future plans

References

  1. George, Bosilca, Foyer Clement, Jeannot Emmanuel, Mercier Guillaume, and Papauré Guillaume. 2017. “Online Dynamic Monitoring of MPI Communications.” Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing. Springer, Cham.
    @inproceeding{Bosilca17online,
      title = {{Online Dynamic Monitoring of MPI Communications}},
      author = {George, Bosilca and Clement, Foyer and Emmanuel, Jeannot and Guillaume, Mercier and Guillaume, Papauré},
      booktitle = {Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing},
      publisher = {Springer, Cham},
      volume = {10417},
      pages = {49--62},
      year = {2017}
    }