نبذة مختصرة : [INTRODUCTION] Comparative genomics provides valuable insights into gene function, phylogeny, molecular evolution, and associations between phenotypic and genomic differences. Such analyses require knowledge about which genes originated from a speciation event (orthologs) or from a duplication event (paralogs). Existing methods to detect orthologs in turn require knowledge of the location of genes in the genome (gene annotation), which is itself a challenging problem, resulting in a growing gap between sequenced and annotated genomes.
[RATIONALE] We developed TOGA (Tool to infer Orthologs from Genome Alignments), a genomics method that integrates orthology inference and gene annotation. TOGA takes as input a gene annotation of a reference species (e.g., human, mouse, or chicken) and a whole-genome alignment between the reference and a query genome (e.g., other mammals or birds). It infers orthologous gene loci in the query genome, annotates and classifies orthologous genes, detects gene losses and duplications, and generates protein and codon alignments. Orthology detection relies on the principle that orthologous sequences are generally more similar to each other than to paralogous sequences. Whereas existing methods work with annotated protein-coding sequences, TOGA extends this similarity principle to non-exonic regions (introns and intergenic regions) and uses machine learning to detect orthologous gene loci based on alignments of intronic and intergenic regions.
[RESULTS] We demonstrate that TOGA’s machine learning classifier detects orthologous gene loci with a very high accuracy, and also works for orthologous genes that underwent translocations or inversions. TOGA improves ortholog detection and comprehensively annotates conserved genes, even if transcriptomics data are available. Although homology-based methods such as TOGA cannot annotate orthologs of genes that are not present in the reference, we show that reference bias can be effectively counteracted by integrating annotations generated with multiple reference species. TOGA can also be applied to highly fragmented genome assemblies, where genes are often split across scaffolds. By accurately identifying and joining orthologous gene fragments, TOGA annotates entire genes and thus increases the utility of fragmented genomes for comparative analyses. TOGA’s gene classification explicitly distinguishes between genes with missing sequences (indicative of assembly incompleteness) and genes with inactivating mutations (potentially indicative of base errors). We show that this classification provides a superior benchmark for assembly completeness and quality. As genomes are generated at an increasing rate, annotation and orthology inference methods that can handle hundreds or thousands of genomes are needed. TOGA’s reference species methodology scales linearly with the number of query species. By applying TOGA with human and mouse as references to 488 placental mammal assemblies and using chicken as a reference for 501 bird assemblies, we created large comparative resources for mammals and birds that comprise gene annotations, ortholog sets, lists of inactivated genes, and multiple codon alignments.
[CONCLUSION] TOGA provides a general strategy to cope with the annotation and orthology inference bottleneck. We envision three major uses. First, TOGA enables phylogenomic analyses of orthologous genes and screens for gene changes (e.g., selection, loss, and duplication) that are associated with phenotypic differences. Second, TOGA provides annotations of genes that are conserved in newly sequenced genomes, which can be supplemented with transcriptomics data to detect lineage-specific genes or exons. Finally, TOGA’s gene classification provides a powerful genome assembly quality benchmark.
[ABSTRACT] Annotating coding genes and inferring orthologs are two classical challenges in genomics and evolutionary biology that have traditionally been approached separately, limiting scalability. We present TOGA (Tool to infer Orthologs from Genome Alignments), a method that integrates structural gene annotation and orthology inference. TOGA implements a different paradigm to infer orthologous loci, improves ortholog detection and annotation of conserved genes compared with state-of-the-art methods, and handles even highly fragmented assemblies. TOGA scales to hundreds of genomes, which we demonstrate by applying it to 488 placental mammal and 501 bird assemblies, creating the largest comparative gene resources so far. Additionally, TOGA detects gene losses, enables selection screens, and automatically provides a superior measure of mammalian genome quality. TOGA is a powerful and scalable method to annotate and compare genes in the genomic era.
No Comments.