Postdoc: DataStates - Tracking, Versioning and Reuse of Intermediate Data
Contact: Bogdan Nicolae
Description: We are exploring a new data model centered around the notion of data states, which are intermediate representations of datasets automatically recorded into a lineage when tagged by applications with hints, constraints and persistency semantics. Such an approach enables the applications to focus on the meaning and properties of their data rather than how to access it, effectively reducing complexity while unlocking high performance and scalability for many use cases: finding and reusing previous intermediate results to explore alternatives, inspecting the evolution of datasets, verifying correctness, etc. This is especially important in the context of deep learning, where there is an acute need for advanced tools that explore many alternative DNN models and/or ensembles to improve accuracy, training speed and ability to generalize/explain a problem.
Link: https://bit.ly/3lcKCPM
Postdoc/PhD subcontract/internship: RRR (robustness, reconfigurability, reproducibility) for HPC+BD+AI workflows
Contact: Bogdan Nicolae
Description: We seek three properties (abbreviated RRR) for hybrid workflows composed of HPC, BD, and AI: robustness (gracefully handle failures and other unexpected events); reconfigurability (dynamically adapt to changing conditions during run-time for optimal performance and scalability); reproducibility (able to replay part or all of a workflow under similar conditions in order to verify the results). In this context, we are exploring novel techniques that exploit relaxed inter-task dependencies to optimize checkpoint-restart strategies (e.g. no need to roll back if enough survivors are in a consistent state, workflow orchestration based on up-to-date overview of behavioral patterns, dynamic reproducibility based on capturing decision provenance.
Postdoc/PhD subcontract/internship: Deep Learning-oriented streams with revisit support
Description: Modern deep learning is not static: training sets: new training data is constantly arriving. However, DNN models cannot be simply trained incrementally due to the problem of catastrophic forgetting, i.e., bias in favor of newer samples at the expense of older ones. Therefore, old samples need to be persisted and revisited, which is not a pattern the storage solutions used by state-of-the-art approaches (e.g., parallel file systems) are optimized to address. In this context, one idea is to extract representative old samples and combine them with new samples such as to enable unbiased retraining. This internship will explore techniques to achieve this goal while mitigating the I/O performance and scalability issues associated with the constant accumulation of new samples (which slows down I/O performance over time).
Postdoc in the context of the ECP project: AI based prediction for data reduction and interference avoidance
Contact: Franck Cappello
Description: The combination of AI and HPC offers new opportunity to improve the performance of scientific applications. How AI can help accelerate HPC operations and ultimately scientific application executions is an important question in this context. The postdoc position will explore how to leverage AI to accelerate data transformations and management in Exascale systems. We list here only 2 examples of potential directions. Other research directions in this context are possible depending on candidate interests and skills. One first potential direction is how to leverage AI to improve data reduction. Recent attempts of using auto-encoders for lossy data reduction are showing promising results. More exploration is needed to better understand and control AI model data reduction performance (ratio, speed, accuracy), potentially in combination with other data reduction techniques. Another potential direction is to explore optimization of asynchronous data movement scheduling between resources of exascale systems. A promising direction is AI based interference avoidance that has already shown positive results on relatively simple cases. How AI based interference avoidance perform on more complex cases and how to optimize it remain open questions.
Postdoctoral Appointee – Computer Science – Tracing Heterogeneous APIs for Exascale
Contact: Brice Videau
Description: ALCF’s performance engineering group is looking for a post-doctoral appointee to perform research and development on a collection of tracers and their uses, in the context of the upcoming exascale platforms, and Aurora in particular. By applying techniques derived from Model Centric Debugging, the candidate will collaborate with application developers and other Argonne Computer Scientists to improve the scope and usefulness of the tracing framework for Heterogeneous computing APIs. The work will take place in a multi-disciplinary environment and will offer opportunities to interact with a wide range of talents from the whole spectrum of HPC research. The successful candidate is expected to contribute into several of the following areas: profiling accelerator usage of HPC applications; debugging accelerator usage; capturing traces that can be reinjected in simulation frameworks; extracting kernels for replay, allowing study and tuning in a sand-box; lightweight and transparent monitoring of platform usage.
About postdoc positions at Argonne National Laboratory
In addition to addressing such transformative challenges that arise at the intersection of HPC, big data analytics and machine learning, postdocs have the opportunity to work closely with many domain experts to identify the requirements and bottlenecks of real-life scientific applications that address the needs of our society over the next decades. In general, you will be part of a vibrant and diverse research community from more than 100 countries. Our lab will host Aurora, one of the first Exascale supercomputers in the world, which you will have an opportunity to use for your experiments. In addition, you will have access to a large array of leading-edge experimental testbeds through the Joint Laboratory for System Evaluation (JLSE), which feature the latest technologies from top vendors like Intel, NVIDIA, AMD, CEREBRAS, etc.
About Argonne National Laboratory
As an equal employment opportunity and affirmative action employer, and in accordance with our core values of impact, safety, respect, integrity and teamwork, Argonne National Laboratory is committed to a diverse and inclusive workplace that fosters collaborative scientific discovery and innovation. In support of this commitment, Argonne encourages minorities, women, veterans and individuals with disabilities to apply for employment. Argonne considers all qualified applicants for employment without regard to age, ancestry, citizenship status, color, disability, gender, gender identity, genetic information, marital status, national origin, pregnancy, race, religion, sexual orientation, veteran status or any other characteristic protected by law. Argonne employees, and certain guest researchers and contractors, are subject to particular restrictions related to participation in Foreign Government Talent Recruitment Programs, as defined and detailed in United States Department of Energy Order 486.1. You will be asked to disclose any such participation in the application phase for review by Argonne’s Legal Department.