Developer tools for porting and tuning parallel applications on extreme-scale parallel systems

Research topic and goals

Application developers targeting extreme-scale HPC systems such as Fugaku & JUPITER, heterogeneous systems such as MareNostrum, and modular supercomputing architectures such as JUWELS Cluster+Booster, need effective tools to assist with porting and tuning for these unusual systems. The XcalableMP compilation system (and directive-based language) (Lee and Sato 2010) (Tsuji et al. 2013), Scalasca/Score-P execution measurement and analysis tools (Geimer et al. 2010) (Knüpfer et al. 2012) (using SIONlib scalable file I/O (Frings, Wolf, and Petkov 2009)) and Paraver/Extrae/Dimemas measurement and analysis tools (“BSC Tools for Performance Analysis” 2017) are notable examples of tools developed by RIKEN, JSC and BSC for this purpose. This project proposes to extend their support for JLESC HPC systems and exploit their capabilities in an integrated work flow. Existing training material will be adapted to collaborators large-scale HPC systems, augmented with newly prepared material, and refined for better uptake based on participant evaluations and feedback. Travel and accommodation expenses of training presenters to participate in joint training events (such as VI-HPS Tuning Workshops (“VI-HPS Tuning Workshop Series” 2016)) will be supported. Collaborative work with application developers will assess the effectiveness of the current (and revised) tools, and help direct development of new tool capabilities.

Results for 2014/2015

Initial planning.
XcalableMP training material translated into English, and Scalasca/Score-P training material translated into Japanese.

Results for 2015/2016

XcalableMP tutorial held at JSC (2015-12-01).
Organisation of 20th VI-HPS Tuning Workshop hosted by RIKEN AICS (2016/02/24-26) (“VI-HPS Tuning Workshop Series” 2016) covering tools from BSC (Paraver/Extrae/Dimemas) and JSC (Scalasca/Score-P/CUBE) on K computer and local Fujitsu FX10 (pi).
Performance analysis of RIKEN FIBER (“FIBER Mini-App Suite” 2016) mini-app NTChem on pi/K.
Performance analysis of ABySS-P (Kitayama, Wylie, and Maeda 2015) and NEST neuronal network simulation tool (“NEST Neural Simulation Tool” 2016) on K computer.

Results for 2016/2017

Specification of XMPT generic tool interface for XcalableMP PGAS runtime (based on OMPT).
Initial prototype implementation of XMPT interface in Omni XMP compiler, used by Extrae/Paraver.
Definition of POP standard metrics for MPI and OpenMP applications (“POP Standard Metrics for Parallel Performance Analysis” 2016).
Document how to obtain POP standard metrics in Paraver (“Paraver Efficiencies Guide” 2016)
Calculation of POP standard metrics as derived metrics by CUBE.
Tools training for NERSC, DKRZ, IT4I, EPCC/Southampton and RWTH using local HPC systems.

Results for 2017/2018

Prototype implementation in Omni XMP compiler of XMPT events for detecting data races of coarrays and for profiling coarray programs.
Large-scale application performance measurements on JUQUEEN, K computer and Blue Waters.
Tools training at BSC (2017-04) and TERATEC (2017-10) using local HPC systems.

Results for 2018/2019

Omni XMP compiler updated with XMPT-defined runtime events, including for local/remote co-array accesses (“XMP Handbook” 2018).
BSC prototyping recording and analysis of additional XMPT events (Impact3D benchmark).
Initial Scalasca/Score-P performance measurements of JURECA-MSA (Cluster+Booster).
Prototype integration of POP metrics within CUBE GUI.
Tools training for LRZ, UCLondon and ROMEO/Reims using local HPC systems.
30th VI-HPS Tuning Workshop hosted by BSC (2019/01/21-25) using MN4 and CTE-POWER9 compute nodes.

Results for 2019/2020

Initial Score-P port to AMD Epyc Rome and Fujitsu Arm A64FX test systems (new compilers and MPI libraries).
Test measurements of NEST application on Fujitsu Arm A64FX test system.
Integration of POP metrics within CUBE GUI.
Tools training for GW4/Bristol and ANF/EoCoE using local HPC systems.
31st VI-HPS Tuning Workshop hosted by UTK-ICL (2019/04/09-12) using Stampede2.
33rd VI-HPS Tuning Workshop hosted by JSC (2019/06/24-28) using JURECA.

Results for 2020/2021

Initial Score-P porting to Fugaku.
Implementation of experimental POP metrics for hybrid MPI+OMP within CUBE GUI.
Virtual tools training for EPCC, HLRS, CINECA & CSC/Frankfurt using local HPC systems.

Results for 2021/2022

Continued Score-P porting to Fugaku.
Virtual tools training for NHR/Erlangen, POP & LRZ using local HPC systems.
Virtual tutorial on POP parallel performance analysis methodology given at ISC-HPC’21.

Results for 2022/2023

Continued Score-P porting to Fugaku and Fujitsu compilers for A64FX.
41st VI-HPS Tuning Workshop hosted virtually by JSC (2022/02/07-10) using JUWELS-Booster GPUs.
42nd VI-HPS Tuning Workshop hosted virtually by JSC (2022/05/17-19) using JUSUF.

Results for 2023/2024

Continued Score-P porting to Fugaku and Fujitsu compilers for A64FX with assistance of Jens Domke and Fujitsu.
Provision of Scalasca/Score-P on Fugaku using GCC & LLVM compilers for A64FX.
Hands-on tools tutorial at SC23 (Denver/CO, USA, 2023/11/13) using JUWELS-Booster GPUs.
EU-ASEAN HPC School (W.Java, Indonesia, 2023/12/11-16) Scalasca/Score-P exercises using Fugaku.
43rd VI-HPS Tuning Workshop hosted by CALMIP Mesocentre in Toulouse/France (2024/01/29-02/01) using Turpan (Ampere Altra Q80 ARM 8.2 CPU + Nvidia Ampere A100 GPU nodes).
Article prepared for FGCS on joint parallel application performance analysis/tools training.

Results for 2024/2025

Publication of FGCS article on joint parallel application performance analysis/tools training.
Hands-on performance tools/analysis tutorial at ISC-HPC’24 (Hamburg).
Virtual tools training for RWTH Aachen/TU Dresden, LRZ and EuroCC/IT4I using local HPC systems.
Installation of tools on EuroHPC computer systems, including MareNostrum5.
Extension of POP analysis methodology and metrics for GPU-accelerated application execution measurements.

Results for 2025/2026

Workshop article on Score-P support for Fugaku, JUPITER and other Arm-based systems.
Hands-on performance analysis/tools tutorial at ISC-HPC’25 (Hamburg).
Virtual tools training for U. Duisburg-Essen using their local HPC system.
Installation of tools on Fugaku, JUPITER, Deucalion and other EuroHPC HPC systems.
Improvement of POP analysis methodology and metrics for GPU-accelerated application execution measurements.

Visits and meetings

Face-to-face meetings at 3rd and subsequent JLESC Workshops, at ISC-HPC, SC and ParCo conferences, and events hosted by project partners. Meeting with MYX project (“Project MYX” 2016) members at ISC-HPC to discuss XMPT tools interface commonalities for correctness checking and performance analysis tools.

2015/12/01: RIKEN-AICS instructors visited JSC to deliver training with XcalableMP. 2016/02/24-26: BSC & JSC instructors visited RIKEN-AICS to deliver training as part of VI-HPS Tuning Workshop. 2019/04/09-12: BSC & JSC instructors visited UTK-ICL to deliver training as part of VI-HPS Tuning Workshop. 2022/10/01-2022/11/04: JSC visit to NCSA & UTK-ICL to prepare for training as part of VI-HPS Tuning Workshops. Visits planned for the next 12 months: none for now

Impact and publications

Joint development of Scalasca & Paraver toolsets and associated training summarised in JLESC special issue of Future Generation Computer Systems (Wylie et al. 2025) POP standard metrics applied in POP services performance analyses.

Wylie, Brian J.N., Judit Giménez, Christian Feld, Markus Geimer, Germán Llort, Sandra Mendez, Estanislao Mercadal, Anke Visser, and Marta García-Gasulla. 2025. “15+ Years of Joint Parallel Application Performance Analysis/Tools Training with Scalasca/Score-P and Paraver/Extrae Toolsets.” Future Generation Computer Systems 162: 107472. https://doi.org/10.1016/j.future.2024.07.050.

@article{Wylie2025,
  author = {Wylie, Brian J.N. and Gim{\'{e}}nez, Judit and Feld, Christian and Geimer, Markus and Llort, Germ{\'{a}}n and Mendez, Sandra and Mercadal, Estanislao and Visser, Anke and Garc{\'{i}}a-Gasulla, Marta},
  doi = {10.1016/j.future.2024.07.050},
  issn = {0167739X},
  journal = {Future Generation Computer Systems},
  publisher = {Elsevier},
  pages = {107472},
  title = {15+ years of joint parallel application performance analysis/tools training with {Scalasca/Score-P} and {Paraver/Extrae} toolsets},
  volume = {162},
  year = {2025}
}

Future plans

Use of Scalasca/Score-P and Paraver/Extrae to analyze execution performance of RIKEN applications. Large-scale application performance measurements on Fugaku and other HPC systems. Hackathon at JSC analysing and scaling applications on JUPITER. Workshops and training organised under the auspices of VI-HPS (“Virtual Institute – High Productivity Supercomputing” 2016) or the POP Centre of Excellence (“Performance Optimisation and Productivity: EU Centre of Excellence” 2015).

References

Feld, Christian, Gregor Corbin, and Brian J. N. Wylie. 2026. “Score-P with Arm(s) around the World ...” In Supercomputing Asia and International Conference on High Performance Computing in Asia Pacifoc Region Workshops (SCA/HPCAsiaWS). ACM. https://doi.org/10.1145/3784828.3785348.

@inproceedings{FeldEtAl2026,
  author = {Feld, Christian and Corbin, Gregor and Wylie, Brian J. N.},
  title = {{Score-P} with {Arm(s)} around the world ...},
  booktitle = {Supercomputing Asia and International Conference on High Performance Computing in Asia Pacifoc Region Workshops (SCA/HPCAsiaWS)},
  location = {Osaka, Japan},
  publisher = {ACM},
  year = {2026},
  month = jan,
  doi = {10.1145/3784828.3785348}
}

“XMP Handbook.” 2018. http://xcalablemp.org/handbook/.

@misc{XMPhandbook,
  title = {XMP Handbook},
  url = {http://xcalablemp.org/handbook/},
  year = {2018}
}

“BSC Tools for Performance Analysis.” 2017. http://tools.bsc.es/.

@misc{BSCtools,
  title = {BSC Tools for Performance Analysis},
  url = {http://tools.bsc.es/},
  year = {2017}
}

“FIBER Mini-App Suite.” 2016. http://fiber-miniapp.github.io/.

@misc{FIBER,
  title = {FIBER Mini-app Suite},
  url = {http://fiber-miniapp.github.io/},
  year = {2016}
}

“Project MYX.” 2016. http://doc.itc.rwth-aachen.de/display/CCP/Project+MYX.

@misc{MYXproject,
  title = {Project MYX},
  url = {http://doc.itc.rwth-aachen.de/display/CCP/Project+MYX},
  year = {2016}
}

“NEST Neural Simulation Tool.” 2016. http://www.nest-simulator.org/.

@misc{NEST,
  title = {NEST Neural Simulation Tool},
  url = {http://www.nest-simulator.org/},
  year = {2016}
}

“POP Standard Metrics for Parallel Performance Analysis.” 2016. https://pop-coe.eu/node/69.

@misc{POPmetrics2016,
  title = {POP Standard Metrics for Parallel Performance Analysis},
  url = {https://pop-coe.eu/node/69},
  year = {2016}
}

“Paraver Efficiencies Guide.” 2016. https://pop-coe.eu/sites/default/files/pop_files/paraverefficenciesguide.pdf.

@misc{POPmetParaver2016,
  title = {Paraver Efficiencies Guide},
  url = {https://pop-coe.eu/sites/default/files/pop_files/paraverefficenciesguide.pdf},
  year = {2016}
}

“Virtual Institute – High Productivity Supercomputing.” 2016. http://www.vi-hps.org/.

@misc{VIHPS,
  title = {Virtual Institute -- High Productivity Supercomputing},
  url = {http://www.vi-hps.org/},
  year = {2016}
}

“VI-HPS Tuning Workshop Series.” 2016. http://www.vi-hps.org/training/tws/.

@misc{VIHPSTWS,
  title = {VI-HPS Tuning Workshop Series},
  url = {http://www.vi-hps.org/training/tws/},
  year = {2016}
}

Kitayama, Itaru, Brian J. N. Wylie, and Toshiyuki Maeda. 2015. “Execution Performance Analysis of the ABySS Genome Sequence Assembler Using Scalasca on the K Computer.” In Proc. Int’l Conf. on Parallel Computing (ParCo, Edinburgh, Scotland). IOS Press. https://juser.fz-juelich.de/record/279895.

@inproceedings{KitayamaEtAl2015,
  author = {Kitayama, Itaru and Wylie, Brian J. N. and Maeda, Toshiyuki},
  booktitle = {Proc. Int'l Conf. on Parallel Computing (ParCo, Edinburgh, Scotland)},
  month = sep,
  publisher = {IOS Press},
  title = {Execution Performance Analysis of the {ABySS} Genome Sequence Assembler using {Scalasca} on the {K} computer},
  url = {https://juser.fz-juelich.de/record/279895},
  year = {2015}
}

“Performance Optimisation and Productivity: EU Centre of Excellence.” 2015. https://www.pop-coe.eu/.

@misc{POP,
  title = {Performance Optimisation and Productivity: EU Centre of Excellence},
  url = {https://www.pop-coe.eu/},
  year = {2015}
}

“VI-HPS Tools Guide.” 2015. http://www.vi-hps.org/upload/material/general/ToolsGuide.pdf.

@misc{VIHPS2015,
  title = {VI-HPS Tools Guide},
  url = {http://www.vi-hps.org/upload/material/general/ToolsGuide.pdf},
  month = oct,
  year = {2015}
}

Tsuji, Miwako, Mitsuhisa Sato, Maxime R. Hugues, and Serge G. Petiton. 2013. “Multiple-SPMD Programming Environment Based on PGAS and Workflow toward Post-Petascale Computing.” In 42nd International Conference on Parallel Processing, ICPP 2013, Lyon, France, October 1-4, 2013, 480–85. https://doi.org/10.1109/ICPP.2013.58.

@inproceedings{TsujiEtAl2013,
  author = {Tsuji, Miwako and Sato, Mitsuhisa and Hugues, Maxime R. and Petiton, Serge G.},
  bibsource = {dblp computer science bibliography, http://dblp.org},
  biburl = {http://dblp.uni-trier.de/rec/bib/conf/icpp/TsujiSHP13},
  booktitle = {42nd International Conference on Parallel Processing, {ICPP} 2013,
      Lyon, France, October 1-4, 2013},
  crossref = {DBLP:conf/icpp/2013},
  doi = {10.1109/ICPP.2013.58},
  timestamp = {Tue, 02 Dec 2014 17:13:28 +0100},
  title = {Multiple-SPMD Programming Environment Based on {PGAS} and Workflow
      toward Post-petascale Computing},
  pages = {480--485},
  url = {http://dx.doi.org/10.1109/ICPP.2013.58},
  year = {2013}
}

Knüpfer, A., C. Rössel, D. an Mey, S. Biersdorff, K. Diethelm, D. Eschweiler, M. Geimer, et al. 2012. “Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir.” In Tools for High Performance Computing 2011, Proceedings of the 5th International Workshop on Parallel Tools for High Performance Computing (Dresden, September 2011). https://doi.org/10.1007/978-3-642-31476-6_7.

@inproceedings{KnuepferEtAl2012,
  author = {Kn{\"{u}}pfer, A. and R{\"{o}}ssel, C. and an Mey, D. and Biersdorff, S. and Diethelm, K. and Eschweiler, D. and Geimer, M. and Gerndt, M. and Lorenz, D. and Malony, A.D. and Nagel, W.E. and Oleynik, Y. and Philippen, P. and Saviankou, P. and Schmidl, D. and Shende, S.S. and Tsch{\"{u}}ter, R. and Wagner, M. and Wesarg, B. and Wolf, F.},
  booktitle = {Tools for High Performance Computing 2011, Proceedings of the 5th
      International Workshop on Parallel Tools for High Performance Computing (Dresden, September 2011)},
  cin = {JSC},
  cid = {I:(DE-Juel1)JSC-20090406},
  comment = {Tools for High Performance Computing 2011, Proceedings of the 5th International 
      Workshop on Parallel Tools for High Performance Computing, September 2011, Dresden},
  doi = {$10.1007/978-3-642-31476-6_7$},
  note = {Record converted from VDB: 12.11.2012},
  pid = {G:(DE-Juel1)FUEK411 / G:(DE-HGF)POF2-411},
  pnm = {Scientific Computing / 411 - Computational Science and Mathematical Methods 
      (POF2-411)},
  title = {Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope,
      Scalasca, TAU, and Vampir},
  typ = {PUB:(DE-HGF)8 / PUB:(DE-HGF)7},
  url = {http://juser.fz-juelich.de/record/23267},
  year = {2012}
}

Geimer, Markus, Felix Wolf, Brian J. N. Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. 2010. “The Scalasca Performance Toolset Architecture.” Concurr. Comput. : Pract. Exper. 22 (6): 702–19. https://doi.org/10.1002/cpe.v22:6.

@article{GeimerEtAl2010,
  author = {Geimer, Markus and Wolf, Felix and Wylie, Brian J. N. and {\'{A}}brah{\'{a}}m, Erika and Becker, Daniel and Mohr, Bernd},
  acmid = {1753234},
  doi = {10.1002/cpe.v22:6},
  issn = {1532-0626},
  issue_date = {April 2010},
  journal = {Concurr. Comput. : Pract. Exper.},
  keywords = {parallel computing, performance analysis, scalability},
  month = apr,
  number = {6},
  numpages = {18},
  pages = {702--719},
  publisher = {John Wiley and Sons Ltd.},
  title = {The Scalasca Performance Toolset Architecture},
  url = {http://dx.doi.org/10.1002/cpe.v22:6},
  volume = {22},
  year = {2010}
}

Lee, Jinpil, and Mitsuhisa Sato. 2010. “Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems.” In 39th International Conference on Parallel Processing, ICPP Workshops 2010, San Diego, California, USA, 13-16 September 2010, 413–20. https://doi.org/10.1109/ICPPW.2010.62.

@inproceedings{LeeSato2010,
  author = {Lee, Jinpil and Sato, Mitsuhisa},
  booktitle = {39th International Conference on Parallel Processing, {ICPP} Workshops
       2010, San Diego, California, USA, 13-16 September 2010},
  bibsource = {dblp computer science bibliography, http://dblp.org},
  biburl = {http://dblp.uni-trier.de/rec/bib/conf/icppw/LeeS10},
  doi = {10.1109/ICPPW.2010.62},
  pages = {413--420},
  timestamp = {Fri, 25 Jul 2014 14:09:13 +0200},
  title = {Implementation and Performance Evaluation of XcalableMP: {A} Parallel
      Programming Language for Distributed Memory Systems},
  url = {http://dx.doi.org/10.1109/ICPPW.2010.62},
  year = {2010}
}

Frings, Wolfgang, Felix Wolf, and Ventsislav Petkov. 2009. “Scalable Massively Parallel I/O to Task-Local Files.” In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 17:1–17:11. SC ’09. ACM. https://doi.org/10.1145/1654059.1654077.

@inproceedings{FringsEtAl2009,
  acmid = {1654077},
  articleno = {17},
  author = {Frings, Wolfgang and Wolf, Felix and Petkov, Ventsislav},
  booktitle = {Proceedings of the Conference on High Performance Computing Networking, Storage and 
      Analysis},
  doi = {10.1145/1654059.1654077},
  isbn = {978-1-60558-744-8},
  location = {Portland, Oregon},
  numpages = {11},
  pages = {17:1--17:11},
  publisher = {ACM},
  series = {SC '09},
  title = {Scalable Massively Parallel I/O to Task-local Files},
  url = {http://doi.acm.org/10.1145/1654059.1654077},
  year = {2009}
}