| Literature DB >> 33871639 |
Natasha Glover1,2,3, Shaoline Sheppard4, Christophe Dessimoz1,2,3,5,6.
Abstract
Homoeologs are pairs of genes or chromosomes in the same species that originated by speciation and were brought back together in the same genome by allopolyploidization. Bioinformatic methods for accurate homoeology inference are crucial for studying the evolutionary consequences of polyploidization, and homoeology is typically inferred on the basis of bidirectional best hit (BBH) and/or positional conservation (synteny). However, these methods neglect the fact that genes can duplicate and move, both prior to and after the allopolyploidization event. These duplications and movements can result in many-to-many and/or nonsyntenic homoeologs-which thus remain undetected and unstudied. Here, using the allotetraploid upland cotton (Gossypium hirsutum) as a case study, we show that conventional approaches indeed miss a substantial proportion of homoeologs. Additionally, we found that many of the missed pairs of homoeologs are broadly and highly expressed. A gene ontology analysis revealed a high proportion of the nonsyntenic and non-BBH homoeologs to be involved in protein translation and are likely to contribute to the functional repertoire of cotton. Thus, from an evolutionary and functional genomics standpoint, choosing a homoeolog inference method which does not solely rely on 1:1 relationship cardinality or synteny is crucial for not missing these potentially important homoeolog pairs.Entities:
Keywords: zzm321990 Gossypium hirsutumzzm321990 ; best bidirectional hit; comparative genomics; cotton; homoeolog; synteny
Mesh:
Year: 2021 PMID: 33871639 PMCID: PMC8214411 DOI: 10.1093/gbe/evab077
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Synteny among homoeolog pairs in the cotton genome. (A) An example of the method for computing synteny scores. For each homoeolog pair (connected red dots), a window of ten neighboring genes around each homoeolog is formed. The synteny score is computed as the fraction of the 10 + 10 = 20 neighbors that have at least one homoeologous counterpart in the other window (blue-dotted lines). (B) Histogram of the synteny scores for the homoeolog pairs (N = 31,901). The first bin includes only synteny scores of 0. The rest of the bins include the rightmost edge.
(A) Overlap between homoeolog pairs found with OMA (green) and pairs found with the BBH method (red). (B) Contingency table of the comprehensive set of homoeolog pairs from OMA, subdivided into categories based on synteny and BBH-status (BBH & syntenic, BBH & nonsyntenic, non-BBH & syntenic, and non-BBH & nonsyntenic). Only pairs for which we were able to compute a synteny score were used (31,901 out of 32,426 pairs after removing those with genes on small scaffolds). The majority of the pairs are both syntenic and BBHs, but 8,276 pairs (25.9%) are either nonsyntenic, non-BBH, or both nonsyntenic and non-BBH. Furthermore, 3,247 pairs (10.2%) are both nonsyntenic and non-BBH.
Characteristics of the four classes of homoeologs. Shown on each plot are BBH & syntenic (blue), BBH & nonsyntenic (orange), non-BBH & syntenic (green), and non-BBH & nonsyntenic (red). (A) Total number of pairs in each category. (B) Distribution of the Nb. Homoeologous Pairs, which is a proxy for the extent of duplication. A pair not having undergone duplication has a Nb. Hom. Pairs = 1. The gene-centric data set was used (see Materials and Methods). (C) Distribution of the evolutionary distances for all homoeolog pairs, measured in PAM units. The pair-centric dataset was used. (D) Distribution of protein lengths, in amino acids. The line in the middle of each boxplot represents the median, and outliers are not shown. The gene-centric dataset was used. A Kolmogorov–Smirnov test between each pair of categories in (B–D) showed a significant difference between all distributions (supplementary table 2, Supplementary Material online).
Expression analysis of the Gossypium hirsutum homoeolog pairs, grouped by synteny/BBH-status. Only genes with a transcripts per kilobase million value (TPM) ≥2 were considered expressed. (A) Survey of percentage of genes per category expressed or not. (B) Expression breadth, that is, number of tissues, in control conditions, in which expression was detected. The violin plot shows the density curve for each homoeolog category, where the width of the curve represents the estimated frequency of data points. (C) Expression level, or the mean TPM, averaged across all 12 tissues. For (B and C), only genes which were expressed at all are shown, and the filtered gene-centric data set was used. Outliers are not shown.
GO Enrichment of Genes from Different Categories of Homoeologs
| Biological Process | Cellular Component | Molecular Function | |
|---|---|---|---|
| BBH & syntenic | Total: 56. Regulation of biological quality, biological process, biological regulation, RNA modification, organic substance metabolic process, cellular process, phospholipid metabolic process, lipid metabolic process, metabolic process, methylation, response to acid chemical, response to stimulus, organelle organization, developmental process | Total: 12. Cytosol, cellular anatomical entity, cellular component, integral component of membrane, membrane, organelle | Total: 36. Protein-binding, molecular function, binding, sequence-specific DNA binding, DNA-binding transcription factor activity, catalytic activity, methyltransferase activity, hydrolase activity, phosphoric ester hydrolase activity, transferase activity, drug binding, zinc ion binding, catalytic activity acting on a protein |
| BBH & nonsyntenic | Total: 13. Translation, ribosomal small subunit assembly | Total: 4. Ribonucleoprotein complex | Total: 7. Structural constituent of ribosome, structural molecule activity, RNA–DNA hybrid ribonuclease activity |
| non-BBH & syntenic | Total: 28. Translation, nucleosome assembly, biosynthetic process, negative regulation of hydrolase activity, cell recognition, recognition of pollen | Total: 14. Ribosome, DNA packaging complex, protein-containing complex | Total: 25. Structural constituent of ribosome, structural molecule activity, protein heterodimerization activity, ADP binding, sulfotransferase activity, protein tag, protein phosphatase inhibitor activity, isoprenoid binding, chromatin DNA binding, P-P-bond-hydrolysis-driven protein transmembrane transporter activity |
| non-BBH & nonsyntenic | Total: 73. ATP biosynthetic process, biosynthetic process, ribonucleoprotein complex assembly, positive regulation of translation, energy coupled proton transport, energy coupled proton transport down electrochemical gradient, respiratory electron transport chain, ATP metabolic process | Total: 24. Ribosome, ribonucleoprotein complex, protein-containing complex, organelle, respirasome | Total: 26. Structural constituent of ribosome, structural molecule activity, rRNA binding, RNA–DNA hybrid ribonuclease activity, protein heterodimerization activity, catalytic activity acting on RNA, protein tag, nucleoside transmembrane transporter activity |
Note.—The nonredundant, gene-centric data set was used for the enrichment. For each category of homoeologs, the study set was all the genes comprising the category, and the background set was all the genes in all categories. Enriched terms with a Bonferroni-corrected P value <0.05 were used to summarize the main GO terms with Revigo. The table shows the total number of GO terms enriched, and the representative terms defined by Revigo.
Proportion of genes per category of homoeologs which were annotated with either “translation” (GO:0006412) or “ribosome” (GO:0005840). Filtered, gene-centric data set was used.