Variable Capacity Scheduling

Research topic and goals

This project focuses on managing and allocating resources in the new age of renewable power generation, where variation in weather and solar radiation drives the carbon-intensity of power, and efforts to reduce environmental damage then create variation in compute capacity. From a computing perspective, the goal is to optimize resource utilization and minimize disruptions by dynamically adjusting job scheduling and resource allocation strategies in response to these capacity variations. Workloads are changing as well, with growing understanding of how computing can be malleable, delay flexible, or even acceptably approximate. From a societal perspective, the opportunity is to enhance sustainability efforts and to minimize the environmental impact of High Performance Computing.

Part of the difficulty of scheduling on variable resources comes from the many conflicting optimization objectives or metrics. Platform-oriented objectives include platform throughput or goodput (defined as the fraction of time resources are involved in productively executing jobs – not idle or computing jobs that will be interrupted). User-oriented objectives include minimization of response time. For fairness, the flow is often scaled by the base execution time (without interruption nor checkpoints). In addition to these traditional metrics of efficiency and responsiveness, one must now take into account environmental issues (also known as non-performance attributes) such as e-waste, data-center water use, power, energy consumption and carbon emissions, as well as network congestion and system heterogeneity. Capping and reducing brown energy consumption are becoming increasingly important, which involves lowering the power level bought on the fixed annual contract and intermittently increasing the power level using green energy and daily contracts. Furthermore, there is a growing need to shift from “fast” to “green” computing, which emphasizes energy efficiency and sustainability in addition to performance.

This project aims at developing new models and algorithms for variable capacity scheduling.

References

Han, Li, Louis-Claude Canon, Henri Casanova, Yves Robert, and Frédéric Vivien. 2018. “Checkpointing Workflows for Fail-Stop Errors.” IEEE Trans. Computers 67 (8): 1105–20.

@article{HanEtAl2018b,
  author = {Han, Li and Canon, Louis-Claude and Casanova, Henri and Robert, Yves and Vivien, Frédéric},
  journal = {IEEE Trans. Computers},
  volume = {67},
  pages = {1105-1120},
  title = {Checkpointing workflows for fail-stop errors},
  number = {8},
  year = {2018}
}

Han, Li, Valentin Le Fèvre, Louis-Claude Canon, Yves Robert, and Frédéric Vivien. 2018. “A Generic Approach to Scheduling and Checkpointing Workflows.” In ICPP2018, The 47th Int. Conf. on Parallel Processing. IEEE Computer Society Press.

@inproceedings{HanEtAl2018,
  author = {Han, Li and Fèvre, Valentin Le and Canon, Louis-Claude and Robert, Yves and Vivien, Frédéric},
  booktitle = {ICPP2018, the 47th Int. Conf. on Parallel Processing},
  title = {A Generic Approach to  Scheduling and Checkpointing Workflows},
  publisher = {{IEEE} Computer Society Press},
  year = {2018}
}

Casanova, Henri, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2015. “On the Impact of Process Replication on Executions of Large-Scale Parallel Applications with Coordinated Checkpointing.” Future Generation Comp. Syst. 51: 7–19. https://doi.org/10.1016/j.future.2015.04.003.

@article{CasanovaEtAl2015,
  author = {Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
  title = {On the impact of process replication on executions of large-scale
                 parallel applications with coordinated checkpointing},
  journal = {Future Generation Comp. Syst.},
  volume = {51},
  pages = {7--19},
  year = {2015},
  url = {http://dx.doi.org/10.1016/j.future.2015.04.003},
  doi = {10.1016/j.future.2015.04.003},
  timestamp = {Thu, 31 Mar 2016 15:45:29 +0200},
  biburl = {http://dblp.uni-trier.de/rec/bib/journals/fgcs/CasanovaRVZ15},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Herault, Thomas, and Yves Robert. 2015. Fault-Tolerance Techniques for High-Performance Computing. 1st ed. Springer Publishing Company, Incorporated.

@book{HeraultEtAl2015,
  author = {Herault, Thomas and Robert, Yves},
  title = {Fault-Tolerance Techniques for High-Performance Computing},
  year = {2015},
  isbn = {3319209426, 9783319209425},
  edition = {1st},
  publisher = {Springer Publishing Company, Incorporated}
}

Bougeret, Marin, Henri Casanova, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2014. “Using Group Replication for Resilience on Exascale Systems.” IJHPCA 28 (2): 210–24. https://doi.org/10.1177/1094342013505348.

@article{BougeretEtAl2014,
  author = {Bougeret, Marin and Casanova, Henri and Robert, Yves and Vivien, Fr{\'{e}}d{\'{e}}ric and Zaidouni, Dounia},
  title = {Using group replication for resilience on exascale systems},
  journal = {{IJHPCA}},
  volume = {28},
  number = {2},
  pages = {210--224},
  year = {2014},
  url = {http://dx.doi.org/10.1177/1094342013505348},
  doi = {10.1177/1094342013505348},
  timestamp = {Mon, 02 Jun 2014 09:36:01 +0200},
  biburl = {http://dblp.uni-trier.de/rec/bib/journals/ijhpca/BougeretCRVZ14},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Ferreira, Kurt B., Jon Stearley, James H. Laros III, Ron Oldfield, Kevin T. Pedretti, Ron Brightwell, Rolf Riesen, Patrick G. Bridges, and Dorian C. Arnold. 2011. “Evaluating the Viability of Process Replication Reliability for Exascale Systems.” In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, 44:1–44:12. https://doi.org/10.1145/2063384.2063443.

@inproceedings{FerreiraEtAl2011,
  author = {Ferreira, Kurt B. and Stearley, Jon and III, James H. Laros and Oldfield, Ron and Pedretti, Kevin T. and Brightwell, Ron and Riesen, Rolf and Bridges, Patrick G. and Arnold, Dorian C.},
  title = {Evaluating the viability of process replication reliability for exascale
                 systems},
  booktitle = {Conference on High Performance Computing Networking, Storage and Analysis,
                 {SC} 2011, Seattle, WA, USA, November 12-18, 2011},
  pages = {44:1--44:12},
  year = {2011},
  crossref = {DBLP:conf/sc/2011},
  url = {http://doi.acm.org/10.1145/2063384.2063443},
  doi = {10.1145/2063384.2063443},
  timestamp = {Tue, 30 Jun 2015 16:34:04 +0200},
  biburl = {http://dblp.uni-trier.de/rec/bib/conf/sc/FerreiraSLOPBRBA11},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}