نبذة مختصرة : International audience ; The concept of the pangenome has emerged as a robust framework for analyzing the genomic diversity across species in an era of next-generation sequencing [1]. Historically, pangenomic studies, particularly those focusing on microbial genomes, have predominantly focused on gene families to delineate core and accessory genomes [2]. However, the criteria used to define gene families are mostly inconsistent in the literature, making it difficult to compare pangenomic metrics between studies [3]. Furthermore, given the significant variability found even within species-level genomes, the majority of microbial pangenomic studies relies on these datasets, often overlooking broader taxonomic analysis. Therefore, to the best of our knowledge, only a limited number of studies, such as SCARAP [4], have developed suitable methods to explore pangenomic diversity beyond the species level. Only a single, outdated and non-reproducible, study provided a comprehensive analysis of this broader diversity using the average identity percentage [5]. This highlights the need to identify relevant strategies to build gene families to rationalize the way pangenomic analysis are performed according to genome diversity and set up accordingly parameter values (percent identity) depending on the diversity of the genome dataset to compare.To determine precisely these parameters, we proceeded the following way. We started by retrieving the genomic sequences of a representative set of reference bacterial species from the Genome Taxonomy Data Base [6], covering consistent taxonomic levels, from phyla to species. We identified the protein-coding sequences (CDS) of each of these genomes similarly. For each pair of genomes at each taxonomic level, we then aligned the set of CDS of one with those of the other to identify the best pairs of homologous CDS. For each pair of homologous CDS, the percentage of identity and coverage rates of the alignment were calculated. Then, for each pair of genomes, we determined the average ...
No Comments.