Advancing Chameleon and Grid'5000 testbeds II

Research topic and goals

Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum). Understanding end-to-end performance in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum (Rosendo et al. 2022).

At Inria we introduced a rigorous methodology for such a process and validated it through E2Clab (Rosendo et al. 2020). It is the first platform to support the complete analysis cycle of an application on the Computing Continuum: (i) configuration of the experimental environment; (ii) mapping between the application parts and machines on the Edge-Fog-Cloud; (iii) deployment of the application on the infrastructure; (iv) automated execution; (v) application optimization (Rosendo et al. 2021); and (vi) gathering of experiment metrics.

The main research goal of this project is to enable scientists to effectively reproduce and explore experiments run in the Grid5000 platform by integrating it with the Jupyter environment and the Trovi portal (“Trovi: Practical Open Reproducibility” 2022). The idea of integrating testbeds such as Chameleon (Keahey et al. 2020) and Grid5000 (Bolze et al. 2006) with Trovi/Jupyter is to have an open access repository of research artifacts which are visible and reproducible across testbeds.

The ultimate goal is to lower the barrier to reproducing research by combining the reproducible artifacts and the experimental environment. We will demonstrate how our Jupyter/Trovi approach for reproducibility helps scientists to reproduce complex Edge-to-Cloud workflows across Chameleon/CHI@Edge/G5K.

Results for 2022/2023

This research work was developed during the summer internship of Daniel Rosendo (INRIA) at ANL (July to September 2022) and was presented in the 14th JLESC Workshop.

We started this project by exploring the following research question: What are the limitations of the existing collaborative environments? We investigate the main state of the art environments, such as: Google Colab, Kaggle, and Code Ocean. We observed that existing approaches lacks support to: (1) providing access to heterogeneous resources (e.g., IoT/Edge devices); (2) practical reproducibility of experiments (e.g., it is hard to reproduce experiments on the exact same hardware since the resources vary over time); and (3) executing experiments at large scale (e.g., users have to pay to access multiple machines).

Based on these limitations, we explored the following research question: What would a good collaborative system look like? In our vision, collaborative environments for enabling Computing Continuum research should provide mainly the following three features: (1) open access to research artifacts to allow other researchers to reproduce published experiments; (2) an interactive computing environment packaged with code, data, environment configurations, and experiment results; and (3) experiment methodologies exploring large-scale scientific testbeds.

Finally, we proposed and implemented our collaborative environment. It provides the following main features: (1) Trovi sharing portal: allow users to package code, data, environment configurations, and results and archive them in this portal, so artifacts can be easily shared and found by other users; (2) Grid’5000 Jupyter environment: guides users to systematically define the experiment workflow and execute experiments; (3) Our experiment methodology: abstracts all the complexities to deploy applications on multiple testbeds (e.g., Grid5000, Chameleon, FIT IoT lab, and CHI@Edge), as well as repeat experiments on the same infrastructure.

We illustrate our collaborative environment with an Edge-to-Cloud experiment workflow deployed on multiple testbeds, such as: Grid5000 + FIT IoT lab; and Chameleon + CHI@Edge. The use case refers to a monitoring system in the African savanna where various edge devices (Raspberry Pi available at FIT IoT lab / CHI@Edge) located in different regions take pictures of animals, perform some image preprocessing and then send images to the cloud server (available at Grid5000 / Chameleon). On the cloud there is a Deep Learning application that identifies the animals. Evaluations show that our collaborative environment has proven useful for reproducing experiments on large-scale platforms from the IoT/Edge to the HPC/Cloud Continuum. It helps users to: (1) systematically configure the experimental environment; (2) deploy distributed applications on multiple testbeds easily; (3) repeat experiments on the same testbed configurations; and (4) make code, data, environment, and results shared easily.

Results for 2023/2024

In 2023, the collaboration between INRIA and ANL continued principally via discussions on reproducibility as well as around an emergent common interest in edge computing.

On the reproducibility front, Daniel Rosendo presented a paper on practical reproducibility in Edge-to-Cloud experiments at the ACM REP conference on June 27, 2023 (Rosendo et al. 2023). The results presented in the paper were largely an outcome of the joint work performed in the prior year.

On the ANL side, we continued our investment in reproducibility topics by engaging a wide community of researchers and educators to advance the mission of popularizing the concept of practical reproducibility in Computer Science. This work was performed as part of the ANL-led REPETO project (“REPETO: Reimagining Experimentation - The Path to Replicable Science” 2024). We held four reproducibility hackathons in collaboration with major CS conferences, including at FAST 2023, ACM REP 2023, ATC/OSDI 2023, and IC2E 2023. In addition, we organized two additional hackathons with the Chameleon community (one at the Chameleon User Meeting in May 2023 and another virtual hackathon in Dec. 2023). These events showed attendees how to package experiments for practical reproducibility and share them on Chameleon’s Trovi service so others could reproduce them. Furthermore, we published a paper (Three Pillars of Practical Reproducibility) at the 2023 IEEE eScience ReWorDS Workshop, outlining the methodology and support needed for practical reproducibility (Keahey et al. 2023).

Over the summer of 2023, members of the REPETO project hosted the first Summer of Reproducibility program. The program offers summer internship opportunities for students and mentors who are interested in reproducibility for computer science research. It is modeled on Google Summer of Code: mentors propose a project and students apply for it. We particularly sponsor projects that package experiments to advance practical reproducibility, i.e., the idea that reproducibility can be a mainstream method of scientific exploration, similar to what reading papers is today. Those experiments can be replayed – and potentially modified and improved to propose and test new ideas – on Chameleon. The program provides funding for both US-based and international students and collaborations. In 2024, the REPETO project will continue to support the Summer of Reproducibility initiative and a call for projects for this summer is already underway.

The discussions of edge computing are emergent with both INRIA and ANL making separate investigations for the time being. The ANL team is working in the context of the CHI@Edge platform on Chameleon (“CHI@Edge” 2024) and FLOTO projects (Keahey et al. 2023). The INRIA team is focusing on two challenges: (1) the efficient provenance data capture at the edge, for reproducibility purposes (Rosendo et al. 2023), and (2) enabling continual learning and federated learning at the edge, in the context of the ENGAGE project (“Engage Project” 2024), where initial results target the efficient deployment of such workloads on the edge-cloud continuum (Prigent et al. 2022) and securing the learning in the heterogeneous and volatile edge environments (Chelli et al. 2023).

Visits and meetings

We schedule regular meetings between the members of the project.

Daniel Rosendo (INRIA) visited ANL in the context of a Student Appointment during summer 2022 (July to September).

Impact and publications

  1. Rosendo, Daniel, Kate Keahey, Alexandru Costan, Matthieu Simonin, Patrick Valduriez, and Gabriel Antoniu. 2023. “KheOps: Cost-Effective Repeatability, Reproducibility, and Replicability of Edge-to-Cloud Experiments.” In Proceedings of the 2023 ACM Conference on Reproducibility and Replicability, 62–73. ACM REP ’23. New York, NY, USA: Association for Computing Machinery. doi:10.1145/3589806.3600032.
    @inproceedings{RosendoEtAl2023,
      author = {Rosendo, Daniel and Keahey, Kate and Costan, Alexandru and Simonin, Matthieu and Valduriez, Patrick and Antoniu, Gabriel},
      title = {KheOps: Cost-effective Repeatability, Reproducibility, and Replicability of Edge-to-Cloud Experiments},
      year = {2023},
      isbn = {9798400701764},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3589806.3600032},
      doi = {10.1145/3589806.3600032},
      booktitle = {Proceedings of the 2023 ACM Conference on Reproducibility and Replicability},
      pages = {62–73},
      numpages = {12},
      keywords = {Workflows, Reproducibility, Replicability, Repeatability, Edge Computing, Computing Continuum, Cloud Computing},
      location = {Santa Cruz, CA, USA},
      series = {ACM REP '23}
    }
    
  2. Rosendo, Daniel, Alexandru Costan, Patrick Valduriez, and Gabriel Antoniu. 2022. “Distributed Intelligence on the Edge-to-Cloud Continuum: A Systematic Literature Review.” Journal of Parallel and Distributed Computing. Elsevier.
    @article{DanielEtAl2022,
      title = {Distributed intelligence on the Edge-to-Cloud Continuum: A systematic literature review},
      author = {Rosendo, Daniel and Costan, Alexandru and Valduriez, Patrick and Antoniu, Gabriel},
      journal = {Journal of Parallel and Distributed Computing},
      year = {2022},
      publisher = {Elsevier}
    }
    
  3. Rosendo, Daniel, Alexandru Costan, Gabriel Antoniu, Matthieu Simonin, Jean-Christophe Lombardo, Alexis Joly, and Patrick Valduriez. 2021. “Reproducible Performance Optimization of Complex Applications on the Edge-to-Cloud Continuum.” In 2021 IEEE International Conference on Cluster Computing (CLUSTER), 23–34. IEEE.
    @inproceedings{DanielEtAl2021,
      title = {Reproducible performance optimization of complex applications on the edge-to-cloud continuum},
      author = {Rosendo, Daniel and Costan, Alexandru and Antoniu, Gabriel and Simonin, Matthieu and Lombardo, Jean-Christophe and Joly, Alexis and Valduriez, Patrick},
      booktitle = {2021 IEEE International Conference on Cluster Computing (CLUSTER)},
      pages = {23--34},
      year = {2021},
      organization = {IEEE}
    }
    
  4. Rosendo, Daniel, Pedro Silva, Matthieu Simonin, Alexandru Costan, and Gabriel Antoniu. 2020. “E2clab: Exploring the Computing Continuum through Repeatable, Replicable and Reproducible Edge-to-Cloud Experiments.” In 2020 IEEE International Conference on Cluster Computing (CLUSTER), 176–86. IEEE.
    @inproceedings{DanielEtAl2020,
      title = {E2clab: Exploring the computing continuum through repeatable, replicable and reproducible edge-to-cloud experiments},
      author = {Rosendo, Daniel and Silva, Pedro and Simonin, Matthieu and Costan, Alexandru and Antoniu, Gabriel},
      booktitle = {2020 IEEE International Conference on Cluster Computing (CLUSTER)},
      pages = {176--186},
      year = {2020},
      organization = {IEEE}
    }
    

Future plans

A first publication is in the works, we plan to submit results to the ACM REP conference. As a continuation of our search work, we will explore the benefits of our collaborative environment considering the point of view of authors of an article, as well as the readers of an article. In this work we are targeting the practical reproducibility of experiments in a cost-effective way, that means: reproducing the exact same experiment environment, hardware/software versions, network topology, processing workflow; and experiment results. The goal is to show that our approach facilitates reproducibility of complex Edge-to-Cloud workflows on open testbeds in a cost-effective way.

For instance, authors want to efficiently configure the experimental infrastructure, not spending a lot of time satisfying all the experiment’s requirements. Besides, they want to easily share their experiment artifacts. While, readers want to perform the experiment, not just read about it. Besides, they want not just the “What” (What the experiment does?), but also the “Why” (Why did authors setup that way?) and “How” (How did authors connect machines/devices together?). Finally, they want to find and access the experiment to be as simple as finding and reading the article.

We will illustrate with a real-life Edge-to-Cloud application workflow and study the performance trade-offs of cloud-only vs edge+cloud processing. We will demonstrate how authors of an article may perform experiments on the Chameleon + CHI@Edge testbeds and then share their artifacts. Then, we show how our collaborative environment also helps readers to have access to the author’s artifacts and reproduce the experiment results in different testbeds, such as the Grid5000 + FIT IoT LAB.

References

  1. “REPETO: Reimagining Experimentation - The Path to Replicable Science.” 2024. https://repeto.cs.uchicago.edu/.
    @online{Repeto2024,
      addendum = {(accessed: 01.31.2024)},
      title = {{REPETO: Reimagining Experimentation - The Path to Replicable Science}},
      url = {https://repeto.cs.uchicago.edu/},
      year = {2024}
    }
    
  2. “CHI@Edge.” 2024. https://www.chameleoncloud.org/experiment/chiedge/.
    @online{ChiEdge2024,
      addendum = {(accessed: 01.31.2024)},
      title = {{CHI@Edge}},
      url = {https://www.chameleoncloud.org/experiment/chiedge/},
      year = {2024}
    }
    
  3. “Engage Project.” 2024. https://engage.inria.fr/.
    @online{Engage2024,
      addendum = {(accessed: 01.31.2024)},
      title = {{Engage Project}},
      url = {https://engage.inria.fr/},
      year = {2024}
    }
    
  4. Chelli, Melvin, Cédric Prigent, René Schubotz, Alexandru Costan, Gabriel Antoniu, Loïc Cudennec, and Philipp Slusallek. 2023. “FedGuard: Selective Parameter Aggregation for Poisoning Attack Mitigation in Federated Learning.” In 2023 IEEE International Conference on Cluster Computing (CLUSTER), 72–81. doi:10.1109/CLUSTER52292.2023.00014.
    @inproceedings{ChelliEtAl2023,
      author = {Chelli, Melvin and Prigent, Cédric and Schubotz, René and Costan, Alexandru and Antoniu, Gabriel and Cudennec, Loïc and Slusallek, Philipp},
      booktitle = {2023 IEEE International Conference on Cluster Computing (CLUSTER)},
      doi = {10.1109/CLUSTER52292.2023.00014},
      keywords = {Training;Federated learning;Computational modeling;Image edge detection;Cluster computing;Sensor systems and applications;Data models;federated learning;malicious peer detection;robust federated learning;adversarial attacks;generative models},
      number = {},
      pages = {72-81},
      title = {FedGuard: Selective Parameter Aggregation for Poisoning Attack Mitigation in Federated Learning},
      volume = {},
      year = {2023}
    }
    
  5. Keahey, Kate, Jason Anderson, Mark Powers, and Adam Cooper. 2023. “Three Pillars of Practical Reproducibility.” In 2023 IEEE 19th International Conference on e-Science (e-Science), 1–6. doi:10.1109/e-Science58273.2023.10254846.
    @inproceedings{KeaheyEtAl2023,
      author = {Keahey, Kate and Anderson, Jason and Powers, Mark and Cooper, Adam},
      booktitle = {2023 IEEE 19th International Conference on e-Science (e-Science)},
      doi = {10.1109/e-Science58273.2023.10254846},
      keywords = {Computer science;Ecosystems;Buildings;Refining;Packaging;Information age;Reproducibility of results;reproducibility;infrastructure;scientific platforms;resource management},
      number = {},
      pages = {1-6},
      title = {Three Pillars of Practical Reproducibility},
      volume = {},
      year = {2023}
    }
    
  6. Keahey, Kate, Nick Feamster, Guilherme Martins, Mark Powers, Marc Richardson, Alexis Schrubbe, and Michael Sherman. 2023. “Discovery Testbed: An Observational Instrument for Broadband Research.” In 2023 IEEE 19th International Conference on e-Science (e-Science), 1–4. doi:10.1109/e-Science58273.2023.10254876.
    @inproceedings{KeaheyEtAl2023b,
      author = {Keahey, Kate and Feamster, Nick and Martins, Guilherme and Powers, Mark and Richardson, Marc and Schrubbe, Alexis and Sherman, Michael},
      booktitle = {2023 IEEE 19th International Conference on e-Science (e-Science)},
      doi = {10.1109/e-Science58273.2023.10254876},
      keywords = {Computers;Instruments;Distributed databases;Data collection;Hardware;Broadband communication;Reliability;infrastructure;instruments;broadband;scientific platforms},
      number = {},
      pages = {1-4},
      title = {Discovery Testbed: An Observational Instrument for Broadband Research},
      year = {2023},
      volume = {}
    }
    
  7. Rosendo, D., M. Mattoso, A. Costan, R. Souza, D. Pina, P. Valduriez, and G. Antoniu. 2023. “ProvLight: Efficient Workflow Provenance Capture on the Edge-to-Cloud Continuum.” In 2023 IEEE International Conference on Cluster Computing (CLUSTER), 221–33. Los Alamitos, CA, USA: IEEE Computer Society. doi:10.1109/CLUSTER52292.2023.00026.
    @inproceedings{RosendoEtAl2023b,
      address = {Los Alamitos, CA, USA},
      author = {Rosendo, D. and Mattoso, M. and Costan, A. and Souza, R. and Pina, D. and Valduriez, P. and Antoniu, G.},
      booktitle = {2023 IEEE International Conference on Cluster Computing (CLUSTER)},
      doi = {10.1109/CLUSTER52292.2023.00026},
      issn = {},
      keywords = {protocols;memory management;key performance indicator;data compression;cluster computing;data models;performance analysis},
      month = nov,
      pages = {221-233},
      publisher = {IEEE Computer Society},
      title = {ProvLight: Efficient Workflow Provenance Capture on the Edge-to-Cloud Continuum},
      url = {https://doi.ieeecomputersociety.org/10.1109/CLUSTER52292.2023.00026},
      volume = {},
      year = {2023}
    }
    
  8. “Trovi: Practical Open Reproducibility.” 2022. https://chameleoncloud.gitbook.io/trovi/.
    @online{ChameleonEtAl2022,
      addendum = {(accessed: 07.14.2022)},
      title = {{Trovi: Practical Open Reproducibility}},
      url = {https://chameleoncloud.gitbook.io/trovi/},
      year = {2022}
    }
    
  9. Prigent, Cédric, Alexandru Costan, Gabriel Antoniu, and Loïc Cudennec. 2022. “Supporting Efficient Workflow Deployment of Federated Learning Systems across the Computing Continuum.” SC 2022 - International Conference for High Performance Computing, Networking, Storage, and Analysis (Posters). https://inria.hal.science/hal-03878254.
    @misc{PrigentEtAl2022,
      author = {Prigent, C{\'e}dric and Costan, Alexandru and Antoniu, Gabriel and Cudennec, Lo{\"i}c},
      booktitle = {{SC 2022 - International Conference for High Performance Computing, Networking, Storage, and Analysis (Posters)}},
      keywords = {Computing Continuum ; Federated Learning ; Workflow ; Hyperparameter optimization},
      month = nov,
      note = {Poster},
      pdf = {https://inria.hal.science/hal-03878254/file/Poster.pdf},
      title = {{Supporting Efficient Workflow Deployment of Federated Learning Systems across the Computing Continuum}},
      url = {https://inria.hal.science/hal-03878254},
      year = {2022}
    }
    
  10. Keahey, Kate, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, et al. 2020. “Lessons Learned from the Chameleon Testbed.” In 2020 USENIX Annual Technical Conference (USENIX ATC 20), 219–33.
    @inproceedings{KateEtAl2020,
      author = {Keahey, Kate and Anderson, Jason and Zhen, Zhuo and Riteau, Pierre and Ruth, Paul and Stanzione, Dan and Cevik, Mert and Colleran, Jacob and Gunawi, Haryadi S and Hammock, Cody and others},
      booktitle = {2020 USENIX Annual Technical Conference (USENIX ATC 20)},
      pages = {219-233},
      title = {Lessons learned from the chameleon testbed},
      year = {2020}
    }
    
  11. Bolze, Raphaël, Franck Cappello, Eddy Caron, Michel Dayde, Frédéric Desprez, Emmanuel Jeannot, Yvon Jégou, et al. 2006. “Grid’5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed.” International Journal of High Performance Computing Applications 20 (4). SAGE Publications: 481–94. doi:10.1177/1094342006070078.
    @article{RaphaEtAl2006,
      author = {Bolze, Rapha{\"e}l and Cappello, Franck and Caron, Eddy and Dayde, Michel and Desprez, Fr{\'e}d{\'e}ric and Jeannot, Emmanuel and J{\'e}gou, Yvon and Lanteri, Stephane and Leduc, Julien and Melab, Nouredine and Mornet, Guillaume and Namyst, Raymond and Primet, Pascale and Qu{\'e}tier, Benjamin and Richard, Olivier and Talbi, El-Ghazali and Touche, Ir{\'e}a},
      doi = {10.1177/1094342006070078},
      hal_id = {hal-00684943},
      hal_version = {v1},
      journal = {{International Journal of High Performance Computing Applications}},
      number = {4},
      pages = {481-494},
      publisher = {{SAGE Publications}},
      title = {{Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed}},
      url = {https://hal.inria.fr/hal-00684943},
      volume = {20},
      year = {2006}
    }