| Literature DB >> 36065404 |
Juan M Escorcia-Rodríguez1, Mario Esposito2, Julio A Freyre-González1, Gabriel Moreno-Hagelsieb2.
Abstract
Orthologs separate after lineages split from each other and paralogs after gene duplications. Thus, orthologs are expected to remain more functionally coherent across lineages, while paralogs have been proposed as a source of new functions. Because protein functional divergence follows from non-synonymous substitutions, we performed an analysis based on the ratio of non-synonymous to synonymous substitutions (dN/dS), as proxy for functional divergence. We used five working definitions of orthology, including reciprocal best hits (RBH), among other definitions based on network analyses and clustering. The results showed that orthologs, by all definitions tested, had values of dN/dS noticeably lower than those of paralogs, suggesting that orthologs generally tend to be more functionally stable than paralogs. The differences in dN/dS ratios remained suggesting the functional stability of orthologs after eliminating gene comparisons with potential problems, such as genes with high codon usage biases, low coverage of either of the aligned sequences, or sequences with very high similarities. Separation by percent identity of the encoded proteins showed that the differences between the dN/dS ratios of orthologs and paralogs were more evident at high sequence identity, less so as identity dropped. The last results suggest that the differences between dN/dS ratios were partially related to differences in protein identity. However, they also suggested that paralogs undergo functional divergence relatively early after duplication. Our analyses indicate that choosing orthologs as probably functionally coherent remains the right approach in comparative genomics.Entities:
Keywords: Functional divergence; Nonsynonymous to synonymous substitutions; Orthologs; Paralogs; Positive selection; dN/dS
Year: 2022 PMID: 36065404 PMCID: PMC9440661 DOI: 10.7717/peerj.13843
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 3.061
Genomes used in this study.
| Genome ID | Class | Order | Species |
|---|---|---|---|
| Phylum proteobacteria | |||
| GCF_000005845 | Gammaproteobacteria | Enterobacterales |
|
| GCF_002370525 | Gammaproteobacteria | Pseudomonadales |
|
| GCF_002847445 | Alphaproteobacteria | Rhodobacterales |
|
| GCF_004194535 | Betaproteobacteria | Neisseriales |
|
| GCF_013085545 | Deltaproteobacteria | Desulfovibrionales |
|
| GCF_013283835 | Epsilonproteobacteria | Campylobacterales |
|
| GCF_000317895 | Oligoflexia | Bdellovibrionales |
|
| GCF_009662475 | Acidithiobacillia | Acidithiobacillales |
|
| GCF_002795805 | Zetaproteobacteria | Mariprofundales |
|
| GCF_003574215 | Hydrogenophilalia | Hydrogenophilales |
|
| Phylum firmcutes | |||
| GCF_000009045 | Bacilli | Bacillales |
|
| GCF_002197645 | Bacilli | Lactobacillales |
|
| GCF_000218855 | Clostridia | Eubacteriales |
|
| GCF_003991135 | Clostridia | Halanaerobiales |
|
| GCF_000020005 | Clostridia | Natranaerobiales |
|
| GCF_003966895 | Negativicutes | Selenomonadales |
|
| GCF_003367905 | Negativicutes | Veillonellales |
|
| GCF_012317185 | Erysipelotrichia | Erysipelotrichales |
|
| GCF_000299355 | Tissierellia | Tissierellales |
|
| GCF_001544015 | Limnochordia | Limnochordales |
|
| Phylum euryarchaeota | |||
| GCF_000025625 | Halobacteria | Natrialbales |
|
| GCF_000011085 | Halobacteria | Halobacteriales |
|
| GCF_000025685 | Halobacteria | Haloferacales |
|
| GCF_000195895 | Methanomicrobia | Methanosarcinales |
|
| GCF_000013445 | Methanomicrobia | Methanomicrobiales |
|
| GCF_001433455 | Thermococci | Thermococcales |
|
| GCF_000024185 | Methanobacteria | Methanobacteriales |
|
| GCF_000006175 | Methanococci | Methanococcales |
|
| GCF_000734035 | Archaeoglobi | Archaeoglobales |
|
| GCF_000007185 | Methanopyri | Methanopyrales |
|
Note:
The query genomes were the first in each group.
Figure 1Non-synonymous to synonymous substitutions (dN/dS).
The dN/dS ratios correspond to genes compared between query organisms against genomes from organisms in the same taxonomic phylum, namely: E. coli against other Proteobacteria, B. subtilis against other Firmicutes, and N. magadii against other Euryarchaeota. Genome identifiers are ordered from most similar to least similar to the query genome. The dN/dS distribution is higher for paralogs, suggesting that a higher proportion of orthologs have retained their functions.
Figure 2Control experiments.
Left: values of dN/dS ratios were higher for different definitions of orthology than for their paralogs. RBH were included as reference. Right: examples of dN/dS values obtained testing for potential biases. The Goldman and Yang model for estimating codon frequencies (Goldman & Yang, 1994), included as reference, is the default. The 80 vs 80 test used data for orthologs and paralogs filtered to contain only alignments covering at least 80% of both proteins. The maximum identity test filtered out sequences more than 70% identical. The CAI test filtered out sequences having Codon Adaptation Indexes (CAI) from the top and bottom 15 percentile of the genome’s CAI distribution. We also tested the effect of the Muse and Gaut model for estimating background codon frequencies (Muse & Gaut, 1994).
Figure 3Non-synonymous to synonymous substitutions dN/dS and divergence.
The difference between dN/dS ratios became less apparent as protein identity decreased.