Contributors: Laboratoire de Mathématiques d'Orsay (LMO); Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS); Statistique mathématique et apprentissage (CELESTE); Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Centre Inria de l'Université Paris-Saclay; Centre Inria de Saclay; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre Inria de Saclay; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria); Institut des Sciences des Plantes de Paris-Saclay (IPS2 (UMR_9213 / UMR_1403)); Université d'Évry-Val-d'Essonne (UEVE)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Cité (UPCité)-Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE); Institut de Biologie Intégrative de la Cellule (I2BC); Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS); AgroParisTech; the INRAE MIGALE bioinformatics facility (MIGALE, INRAE, 2020. Migale BioinformaticsFacility, doi:10.15454/1.5572390655343293E12); ANR-11-LABX-0056,LMH,LabEx Mathématique Hadamard(2011); ANR-17-EURE-0007,SPS-GSR,Ecole Universitaire de Recherche de Sciences des Plantes de Paris-Saclay(2017)
نبذة مختصرة : 29 pages, 9 figures, 3 tables ; The size of the data sets is increasing, providing a large number of variables to describe a phenomenon. Assuming that the relationship between the active variables and the response variable is linear, the high-dimensional Gaussian linear regression provides a relevant framework to identify active variables related to the response variable. Many methods exist, and in this article, we focus on methods based on regularization paths. We perform a comparison study by considering different simulation settings and evaluate the performance of the methods. Our results show that the ability to discriminate between active and inactive variables is important and difficult when the data are not normally distributed and there is a dependency structure between variables. We observe that LARS combined with Elastic-net often gives the best performances. Finally, even if no method is optimal, it was possible to group the methods into groups according to their performance and the characteristics of the dataset.
No Comments.