A Comprehensive Software Stack for Dynamic Resources Management.

  • Head
  • Iserte Sergio (BSC)
  • Members
  • Huber Dominik (INRIA)
  • Schreiber Martin (INRIA)
  • Dutot Pierre-François (INRIA)
  • Ricard Olivier (INRIA)
  • Peña Antonio J. ()

Research topic and goals

Dynamic Resource Management (DRM) allows for dynamic changes of the resources assigned to a job during its execution. DRM has gained considerable interest over the last years as it could provide many benefits to providers of HPC systems and their users, such as improving energy efficiency and throughput.

However, DRM requires changes and interactions throughout the whole HPC software stack. This project aims to integrate different layers of the HPC software stack, including the OAR job and resource manager system, the Dynamic Processes with PSets (DPP) approach, and the Dynamic Management of Resources (DMR) programming interface.

OAR RJMS introduces the job envelope to support DRM. As resources are requested or freed, a new RJMS job is created and linked to the initial job envelope, replacing any previous job. This creates a sequence of jobs with different associated resources, forming a virtual global dynamic job.

Dynamic Processes with PSets (DPP) (Huber et al. 2024) is a set of design paradigms for generic dynamic resource support in parallel programming models deduced from prior work. The DPP design paradigms are based on a system-application co-design and aim for a flexible and programming model agnostic abstraction. As a proof-of-concept, the DPP paradigms have been realized in a prototype based on Open-MPI, OpenPMIx, and PRRTE.

The Dynamic Management of Resources (DMR) framework (Iserte et al. 2021) is a high-level API that facilitates the adoption of dynamicity in HPC codes. Particularly, DMR can abstract the use of different MPI dynamic solutions into the same syntax. In a nutshell, for instance, in an iterative code, DMR provides a series of operations around the main loop, which makes all the dynamic logic transparent to the user behind the scenes.

Once provided with the DRM framework, several scientific applications and benchmarks will be updated with the dynamic resources paradigm and evaluated regarding coding usability and reconfiguration performance.

This project aims to:

  • Integration of DPP and DMR in OAR.
  • Provide a friendly programming layer for the DRM software stack.
  • Create a collection of dynamic applications with a common interface.
  • Evaluate coding usability and performance of the new dynamic resources management approach compared to the current state-of-the-art.

Results for 2024/2025

  • The integration of DMR and DPP has been improved as part of a collaboration between UGA and BSC.
  • We plan to work on the common API during this term.

Visits and meetings

  • May - July 2024: Dominiks - research stay at UGA.
  • February 2025: Sergio - research stay at UGA.

Funding

  • Sergio Iserte from BSC received the BSC’s Severo Ochoa Mobility Grant for his 1 month stay at UGA.
  • Martin Schreiber received travel support from Inria to attend JLESC meetings.

Impact and publications

(Tarraf et al. 2024) (Dutot et al. 2024)

Future plans

None yet.

References

  1. Huber, Dominik, Martin Schreiber, Martin Schulz, Howard Pritchard, and Daniel Holmes. 2024. “Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models.”
    @misc{huber2024design,
      title = {Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models},
      author = {Huber, Dominik and Schreiber, Martin and Schulz, Martin and Pritchard, Howard and Holmes, Daniel},
      year = {2024},
      eprint = {2403.17107},
      archiveprefix = {arXiv},
      primaryclass = {cs.DC}
    }
    
  2. Iserte, Sergio, Rafael Mayo, Enrique S. Quintana-Ortí, and Antonio J. Peña. 2021. “DMRlib: Easy-Coding and Efficient Resource Management for Job Malleability.” IEEE Transactions on Computers 70 (9): 1443–57. https://doi.org/10.1109/TC.2020.3022933.
    @article{iserte_dmrlib_2021,
      title = {{DMRlib}: {Easy}-{Coding} and {Efficient} {Resource} {Management} for {Job} {Malleability}},
      volume = {70},
      copyright = {All rights reserved},
      issn = {1557-9956},
      shorttitle = {{DMRlib}},
      url = {https://ieeexplore.ieee.org/document/9190024},
      doi = {10.1109/TC.2020.3022933},
      number = {9},
      urldate = {2024-01-23},
      journal = {IEEE Transactions on Computers},
      author = {Iserte, Sergio and Mayo, Rafael and Quintana-Ortí, Enrique S. and Peña, Antonio J.},
      month = sep,
      year = {2021},
      note = {Conference Name: IEEE Transactions on Computers},
      pages = {1443--1457},
      file = {Full Text:C\:\\Users\\siser\\Zotero\\storage\\7H5IJ6XY\\Iserte et al. - 2021 - DMRlib Easy-Coding and Efficient Resource Managem.pdf:application/pdf}
    }
    
    Process malleability has proved to have a highly positive impact on the resource utilization and global productivity in data centers compared with the conventional static resource allocation policy. However, the non-negligible additional development effort this solution imposes has constrained its adoption by the scientific programming community. In this work, we present DMRlib, a library designed to offer the global advantages of process malleability while providing a minimalist MPI-like syntax. The library includes a series of predefined communication patterns that greatly ease the development of malleable applications. In addition, we deploy several scenarios to demonstrate the positive impact of process malleability featuring different scalability patterns. Concretely, we study two job submission modes (rigid and moldable) in order to identify the best-case scenarios for malleability using metrics such as resource allocation rate, completed jobs per second, and energy consumption. The experiments prove that our elastic approach may improve global throughput by a factor higher than 3x compared to the traditional workloads of non-malleable jobs.
  3. Tarraf, Ahmad, Martin Schreiber, Alberto Cascajo, Jean-Baptiste Besnard, Marc-André Vef, Dominik Huber, Sonja Happ, et al. 2024. “Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities.” IEEE Transactions on Parallel and Distributed Systems, June, 1–14. https://doi.org/10.1109/TPDS.2024.3406764.
    @article{tarraf_malleability_2024,
      title = {Malleability in Modern {HPC} Systems: Current Experiences, Challenges, and Future Opportunities},
      issn = {1558-2183},
      url = {https://ieeexplore.ieee.org/document/10541114},
      doi = {10.1109/TPDS.2024.3406764},
      shorttitle = {Malleability in Modern {HPC} Systems},
      pages = {1--14},
      journaltitle = {{IEEE} Transactions on Parallel and Distributed Systems},
      author = {Tarraf, Ahmad and Schreiber, Martin and Cascajo, Alberto and Besnard, Jean-Baptiste and Vef, Marc-André and Huber, Dominik and Happ, Sonja and Brinkmann, André and Singh, David E. and Hoppe, Hans-Christian and Miranda, Alberto and Peña, Antonio J. and Machado, Rui and Gasulla, Marta Garcia- and Schulz, Martin and Carpenter, Paul and Pickartz, Simon and Rotaru, Tiberiu and Iserte, Sergio and Lopez, Victor and Ejarque, Jorge and Sirwani, Heena and Wolf, Felix},
      urldate = {2024-05-31},
      date = {2024-06},
      note = {Conference Name: {IEEE} Transactions on Parallel and Distributed Systems},
      keywords = {Resource management, Runtime, Throughput, {HPC}, Monitoring, Dynamic scheduling, Malleability, State-of-the-art, Survey, Systems support, Terminology},
      file = {IEEE Xplore Abstract Record:C\:\\Users\\bscuser\\Zotero\\storage\\B3Q8QUQ6\\10541114.html:text/html;IEEE Xplore Full Text PDF:C\:\\Users\\bscuser\\Zotero\\storage\\8RKRQ7JV\\Tarraf et al. - 2024 - Malleability in Modern HPC Systems Current Experi.pdf:application/pdf}
    }
    
  4. Dutot, P., J. Fecht, K. Gaddameedi, D. Huber, S. Iserte, M. Minion, M. Schulz, et al. 2024. “Leveraging Dynamic Resource Management in HPC.” In ISC’24.
    @inproceedings{dutot_leveraging_2024,
      location = {Hamburg, Germany},
      title = {Leveraging Dynamic Resource Management in {HPC}},
      booktitle = {{ISC}'24},
      author = {Dutot, P. and Fecht, J. and Gaddameedi, K. and Huber, D. and Iserte, S. and Minion, M. and Schulz, M. and Schreiber, M. and Schüller, V. and Peña, A. J. and Richard, O.},
      date = {2024-06}
    }