Contributors: Data Aware Large Scale Computing (DATAMOVE ); Inria Grenoble - Rhône-Alpes; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Grenoble (LIG); Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ); Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ); Université Grenoble Alpes (UGA); Universidade Federal do ABC = Federal University of ABC = Université Fédérale de l'ABC Brazil (UFABC); Système d’exploitation, systèmes répartis, de l’intergiciel à l’architecture (IRIT-SEPIA); Institut de recherche en informatique de Toulouse (IRIT); Université Toulouse Capitole (UT Capitole); Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse - Jean Jaurès (UT2J); Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3); Université de Toulouse (UT)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP); Université de Toulouse (UT)-Toulouse Mind & Brain Institut (TMBI); Université Toulouse - Jean Jaurès (UT2J); Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3); Université de Toulouse (UT)-Université Toulouse Capitole (UT Capitole); Université de Toulouse (UT); Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M.; ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019); ANR-18-CE25-0008,Energumen,Optimiser l'énergie des plates-formes de calcul à large échelle(2018); ANR-22-EXNU-0003,Exa-Soft,High Performance Computing software and tools(2022); ANR-23-PECL-0003,CARECloud,Comprendre, Améliorer, Réduire les impacts Environnementaux du Cloud computing(2023); European Project: 956560,REGALE; European Project: LIGHTAIGDE
نبذة مختصرة : International audience ; With the increase of demand for computing resources and the struggle to provide the necessary energy, power-aware resource management is becoming a major issue for the High-performance computing (HPC) community.Including reliable energy management to a supercomputer's resource and job management system (RJMS) is not an easy task. The energy consumption of jobs is rarely known in advance and the workload of every machine is unique and different from the others. We argue that the first step toward properly managing energy is to deeply understand the energy consumption of the workload, which involves predicting the workload's power consumption and exploiting it by using smart power-aware scheduling algorithms. Crucial questions are (i) how sophisticated a prediction method needs to be to provide accurate workload power predictions, and (ii) to what point an accurate workload's power prediction translates into efficient energy management. In this work, we propose a method to predict and exploit HPC workloads' power consumption, with the objective of reducing the supercomputer's power consumption while maintaining the management (scheduling) performance of the RJMS. Our method exploits workload submission logs with power monitoring data, and relies on a mix of light-weight power prediction methods and a classical EASY Backfillling inspired heuristic. We base this study on logs of Marconi 100, a 980 servers supercomputer. We show using simulation that a light-weight history-based prediction method can provide accurate enough power prediction to improve the energy management of a large scale supercomputer compared to energy-unaware scheduling algorithms. These improvements have no significant negative impact on performance.
No Comments.