Towards Continual Learning at Scale

Research topic and goals

During the past decade, Deep learning (DL) supported the shift from rule-based systems towards statistical models. Deep Neural Networks (DNNs) revolutionized how we address problems in a wide range of applications by extracting patterns from complex yet labelled datasets. In the same way that more-powerful computers made it possible to design networks with vastly more neurons, ever-growing volumes of data act as a driving force for advancements in this field. Bigger models and larger centralized datasets demand for distributed strategies to leverage multiple compute nodes.

Most existing supervised learning algorithms operate under the assumptions that the data is (1) independent and identically distributed (i.i.d.); and (2) available before the training process. However, these constraints stand in the way of many real-life scenarios where the aforementioned datasets are replaced by high volume, high velocity data streams generated over time by distributed (sometimes geographically) devices. It is unfeasible to keep training the models in an offline fashion from scratch every time new data arrives, as this would lead to prohibitive time and/or resource constraints. Also, typical DNNs suffer from catastrophic forgetting in this context, a phenomenon causing them to reinforce new patterns at the expense of previously acquired knowledge (i.e., a bias towards new samples). Some authors have shown that memory replay methods are effective in mitigating accuracy degradation in such settings. However, their performance is still far from that of oracles with full access to the static dataset. The problem of Continual Learning (CL) remains an open research question.

Existing research typically addresses distributed DL and CL separately. At INRIA, we are interested in how CL methods can take advantage of data parallelization across nodes, which is one of the main techniques to achieve training scalability on HPC systems. The aggregated memory could benefit the accuracy achieved by such algorithms by instantiating distributed replay buffers. The main research goals of this project are the (1) design and implementation of a distributed replay buffer leveraging distributed systems effectively and the (2) study of trade-offs introduced by large-scale CL in terms of training time, accuracy and memory usage.

Results for 2021/2022

We kicked off this project in December 2021. We are studying techniques based on rehearsal (augment mini-batches with representative samples previously encountered during training) to address the aforementioned challenges. The key novelty is how to adopt rehearsal in the context of data-parallel training, which is one of the main techniques to achieve training scalability on HPC systems. In this sense, the goal is to design and implement a distributed rehearsal buffer that handles the selection of representative samples and the augmentation of mini-batches asynchronously in the background.

Our first series of experiments focuses on evaluating the performance and scalability of our proposal for classification problems. We run extensive experiments on up to 128 GPUs of ANL’s ThetaGPU supercomputer to compare our approach with baselines representative of training-from-scratch (the upper bound in terms of accuracy) and incremental training (the lower bound) (Bouvier et al. 2024).

Results for 2023/2024

With a growing diversity of rehearsal techniques, it becomes important to decouple the rehearsal buffer from the learning task, such that it becomes a generic, reusable abstraction that can store additional state information as needed by more advanced rehearsal-based CL algorithms. To this end, we propose a generalization of rehearsal buffers to support both classification and generative learning tasks, as well as more advanced rehearsal strategies (notably Dark Experience Replay, leveraging knowledge distillation). We illustrate this approach with a real-life HPC streaming application from the domain of ptychographic image reconstruction, leveraging data acquired at ANL’s Advanced Photon Source (APS) (Bouvier et al. 2024).

Results for 2024/2025

We have started exploring the integration of rehearsal techniques in the context of LLM training using the Nanotron runtime (https://github.com/huggingface/nanotron). To this end, we have studied how the representation of the training samples and mini-batches used for LLM trainig is different from the representation we have implemented in our distributed rehearsal buffer solution. Based on this study, we are planning to extend our approach accordingly.

Visits and meetings

We schedule regular video meetings between the different members of the project.

Thomas Bouvier (INRIA) visited ANL in the context of a 3-month appointment during summer 2022. Thomas Bouvier (INRIA) graduated meanwhile, we are currently looking for new team members to join the project.

Impact and publications

Bouvier, Thomas, Bogdan Nicolae, Hugo Chaugier, Alexandru Costan, Ian Foster, and Gabriel Antoniu. 2024. “Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers.” In CCGrid 2024 - IEEE 24th International Symposium on Cluster, Cloud and Internet Computing, 1–10. Philadelphia (PA), United States. https://doi.org/10.1109/CCGrid59990.2024.00036.

@inproceedings{bouvierEtAl2024,
  address = {Philadelphia (PA), United States},
  author = {Bouvier, Thomas and Nicolae, Bogdan and Chaugier, Hugo and Costan, Alexandru and Foster, Ian and Antoniu, Gabriel},
  booktitle = {{CCGrid 2024 - IEEE 24th International Symposium on Cluster, Cloud and Internet
      Computing}},
  doi = {10.1109/CCGrid59990.2024.00036},
  hal_id = {hal-04600107},
  hal_version = {v1},
  keywords = {continual learning ; data-parallel training ; experience replay ; distributed
      rehearsal buffers ; asynchronous data management ; scalability},
  month = may,
  pages = {1-10},
  pdf = {https://inria.hal.science/hal-04600107/file/paper.pdf},
  title = {{Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal
      Buffers}},
  url = {https://inria.hal.science/hal-04600107},
  year = {2024}
}

Bouvier, Thomas, Bogdan Nicolae, Alexandru Costan, Tekin Bicer, Ian Foster, and Gabriel Antoniu. 2024. “Efficient Distributed Continual Learning for Steering Experiments in Real-Time.” Future Generation Computer Systems, July. https://doi.org/10.1016/j.future.2024.07.016.

@article{bouvierEtAl2024b,
  author = {Bouvier, Thomas and Nicolae, Bogdan and Costan, Alexandru and Bicer, Tekin and Foster, Ian and Antoniu, Gabriel},
  doi = {10.1016/j.future.2024.07.016},
  hal_id = {hal-04664176},
  hal_version = {v2},
  journal = {{Future Generation Computer Systems}},
  keywords = {continual learning ; data-parallel training ; experience replay ; distributed
      rehearsal buffers ; asynchronous data management ; scalability ; streaming ; generative AI},
  month = jul,
  pdf = {https://inria.hal.science/hal-04664176v2/file/paper.pdf},
  publisher = {{Elsevier}},
  title = {{Efficient Distributed Continual Learning for Steering Experiments in Real-Time}},
  url = {https://inria.hal.science/hal-04664176},
  year = {2024}
}

Future plans

Apply rehearsal-based CL to LLM training, by integrating the distributed rehearsal buffer into training runtimes like DeepSpeed and Nanotron.
Use the distributed rehearsal buffer as a driver for Retrieval Augmented Generation (RAG).