| Literature DB >> 32298330 |
Gavin C Conant1,2,3,4.
Abstract
The ancestor of most teleost fishes underwent a whole-genome duplication event three hundred million years ago. Despite its antiquity, the effects of this event are evident both in the structure of teleost genomes and in how the surviving duplicated genes still operate to drive form and function. I inferred a set of shared syntenic regions that survive from the teleost genome duplication (TGD) using eight teleost genomes and the outgroup gar genome (which lacks the TGD). I then phylogenetically modeled the TGD's resolution via shared and independent gene losses and applied a new simulation-based statistical test for the presence of bias toward the preservation of genes from one parental subgenome. On the basis of that test, I argue that the TGD was likely an allopolyploidy. I find that duplicate genes surviving from this duplication in zebrafish are less likely to function in early embryo development than are genes that have returned to single copy at some point in this species' history. The tissues these ohnologs are expressed in, as well as their biological functions, lend support to recent suggestions that the TGD was the source of a morphological innovation in the structure of the teleost retina. Surviving duplicates also appear less likely to be essential than singletons, despite the fact that their single-copy orthologs in mouse are no less essential than other genes.Entities:
Mesh:
Year: 2020 PMID: 32298330 PMCID: PMC7161988 DOI: 10.1371/journal.pone.0231356
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Resolution of the TGD through ohnolog losses.
A) Shown is the assumed phylogeny of the eight species analyzed (see Methods). The TGD induces two mirrored gene trees, corresponding to the genes from the less fractionated parental genome (top) and the more fractionated parental genome (bottom, see Results for tests of the significance of the level of biased fractionation). Below the branches in each tree are POInT’s predicted number of gene losses along that branch for the parental genome in question. Above the branches in the upper tree are POInT’s branch length estimates, namely t (time) multiplied by the α parameter in Fig 2. Here αt corresponds to the overall estimated level of gene loss on that branch: a larger αt implies a greater number of losses relative to the total number of surviving ohnologs at the start of the branch. In the upper left are POInT’s parameter estimates (γ,ε1,δ) for the WGD-bcf model (see Fig 2). B) An example region of the eight genomes, showing the blocks of DCS. For all species except zebrafish, truncated Ensembl gene identifiers are given; for zebrafish gene names are shown. The numbers above each column gives POInT’s confidence in the orthology relationship shown, relative to the 2−1 (= 255) other possible orthology relationships. These other relationships entail swapping the two tracks of genes from one or more of the genomes between the top and the bottom panel: the confidence estimates indicate how much worse a fit is induced by assuming a different set of subgenome assignments. Genes are color-coded based on the pattern of ohnolog survival in the eight genomes. A pair of ohnologs expressed in the zebrafish retina are shown in magenta.
Fig 2Testing nested models of post-WGD ohnolog evolution.
A) Model states and parameter definitions for the set of models considered. U (Unduplicated), C (Converging state 1), C (Converging state 2) and F (Fixed) are duplicated states, while S (Single-copy 1) and S (Single-copy 2) are single-copy states (see Methods). C and S are states where the gene from the less-fractionated parental subgenome will be or are preserved, and C and S the corresponding states for the more-fractionated parental subgenome. The fractionation rate ε (the probability of the loss of a gene from the less fractionated subgenome relative to the more fractionated one) can either be the same for conversions to C and C as it is for S and S (ε1 = ε2) or it can differ (see B). The weights of the various arrows give a cartoon impression of the relative frequency of the different events: exact parameter estimates for the WGD-bcf model are given in Fig 1. B) Testing nested models of WGD resolution. The most basic model (top) has neither biased fractionation nor duplicate fixation nor convergent losses. Adding any of these three processes improves the model fit (second row; blue arrows indicating statistical significance; P<10−10). Adding the remaining two processes also improves the fit in all three cases (WGD-bcf model in the third row; P<10−10). However, there is no evidence that the ε2 parameter is significantly different from 1.0 (WGD-bcf does not improve the fit over WGD-bcf, gray arrow indicating a lack of significant improvement in fit from the more complex model), implying no biased fractionation in the transitions to states C and C. Likewise, there is no evidence that the η parameter is significantly different from 1.0 (WGD-bcf does not improve fit over WGD-bcf), meaning that losses from C and C occur at similar rates as do losses from U. Hence, the WGD-bcf model is best supported by these data and is used for the remaining analyses. Model names: WGD-n: Null model; WGD-b: Biased fractionation model; WGD-f: Fixation model; WGD-c: Convergence model; WGD-bcf: Bias/Convergence/Fixation model; WGD-bcf: Bias/ Convergence (non-biased)/Fixation model; WGD-bcf: Bias (2 rate)/Convergence/Fixation model; WGD-bcf: Bias/Convergence (non-biased convergence, neutral convergent loss)/ Fixation model.
Fig 3The estimated value of the biased fractionation parameter ε in the real teleost genomes (WGD-bf model, arrow, see Methods) is significantly different than those estimated from simulated genomes where biased fractionation was explicitly not included in the model (e.g., simulated ε = 1.0, bars).
Estimates of ε from these 100 simulations are always less than 1.0 because the model fits stochastic variations in the preservation patterns as potential biased fractionation. However, this stochastic variation never yields estimates of ε as small as seen in the real dataset (P<0.01).
Fig 4Timing of gene expression in development compared to patterns of ohnolog loss and retention.
On the x-axis is a timeline of zebrafish development from ZFIN [72], with the relevant stage names indicated at the top. The trendline in red indicates the proportion of zebrafish genes with an ohnolog partner first expressed at that stage (relative to total number of zebrafish genes analyzed with POInT and expressed at that stage). The dotted red line is the overall proportion of genes with an ohnolog partner in the POInT dataset (Dr_Ohno_POInT), while the dashed line is this proportion excluding any genes expressed in the zygote (see Methods). Open points show no statistically distinguishable difference from the overall proportion [chi-square test with an FDR correction, P>0.05; 74]. Red-filled points are significantly different from this overall mean (P≤0.05). Each point is labeled with the number of genes first expressed at that stage that have a surviving ohnolog and the number that do not. Trendlines in blue show similar values comparing the set of genes that POInT predicts were returned to single copy along the root branch of Fig 1 (confidence ≥ 0.85) to those only returned to single-copy along the tip branch leading to zebrafish. Hence, the right y-axis gives the proportion of losses that occurred along the root branch (relative to the sum of that number and the number of losses along the zebrafish branch). The dotted blue line is the overall proportion of genes returned to single-copy on the root branch (scaled as just described) while the dashed line is this proportion excluding any genes expressed in the zygote (see Methods). Open points are not statistically different from the overall proportion [chi-square test with an FDR correction, P>0.05; 74]. Blue-filled points are significantly different from this mean (P≤0.05), while green filled points are also different from the mean seen when zygotic-expressed genes are excluded (P≤0.05). Each point is labeled with the number of genes first expressed at that stage that returned to single copy along the root branch and along the branch leading to zebrafish.
Expression timing and fate of TGD-produced ohnologs.
| Expression cluster | 1st gene set | 2nd gene set | Prop. of 1st set in cluster | Prop. of 2nd set in cluster | |
|---|---|---|---|---|---|
| Maternal transcripts | 0.03 (116/4279) | 0.04 (484/11616) | |||
| 0.03 (81/2552) | 0.04 (193/4408) | ||||
| 0.05 (103/1894) | 0.03 (8/250) | 0.18 | |||
| Pre-MBT transcripts | 0.10 (435/4279) | 0.15 (1709/11616) | |||
| 0.11 (284/2552) | 0.17 (762/4408) | ||||
| 0.19 (351/1894) | 0.12 (31/250) | ||||
| Zygotic transcripts | 0.06 (250/4279) | 0.05 (573/11616) | |||
| 0.06 (142/2552) | 0.05 (216/4408) | 0.25 | |||
| 0.05 (92/1894) | 0.05 (12/250) | >0.95 |
a: Proportion of all genes in the set (see left) that were observed to be expressed in the cluster in question, with the total number of expressed genes over the total number of genes in that set given in parentheses.
b: P-value for the hypothesis test of equal proportion of genes in both sets falling into the expression cluster (chi-square test with 1 degree of freedom)
c: Genes determined by Aanes et al., [73] to have been expressed in the developing embryo from maternally-derived transcripts.
d: Comparison of all identified zebrafish ohnologs to all zebrafish single-copy (with respect to the TGD) genes, comprising 15,895 of the 19,436 zebrafish genes with gar homologs. See Methods for further details.
e: Comparison of all zebrafish ohnolog pairs found in the 8-species POInT analysis to the corresponding zebrafish single-copy (with respect to the TGD) genes. See Methods for further details.
f: Comparison of zebrafish single-copy genes inferred to have been lost on the common root branch of Fig 1 to zebrafish single-copy genes inferred by POInT to have been lost after the zebrafish/cavefish split (inference confidence ≥ 0.85 in both cases). See Methods for further details.
g: Genes determined by Aanes et al., [73] to have been expressed in the developing embryo prior to the mid-blastula transition (<3.5 hours post-fertilization).
h: Genes determined by Aanes et al., [73] to have been expressed in the developing embryo only after the mid-blastula transition (>3.5 hours post-fertilization).
Essentiality and the TGD.
| Essentiality data | Prop. of phenotyped genes with an ohnolog that are essential | Prop. of phenotyped genes without an ohnolog that are essential | |
|---|---|---|---|
| Zebrafish | 0.062 (6/97) | 0.145 (46/318) | |
| Mouse | 0.556 (42/72) | 0.506 (161/318) | 0.53 |
a: P-value for the hypothesis test of equal proportion of essential genes in Dr_Ohno_all vs Dr_Sing_all.
b: Essentiality defined as genes in the ZFIN database [72] phenotyped as “lethal,” “dead” or “inviable.”
c: Numbers in the parenthesis give the number of essential genes over the total number of ohnologs in the set (Dr_Ohno_all).
d: Numbers in the parenthesis give the number of essential genes over the total number of single-copy genes in the set (Dr_Sing_all).
e: Essentiality defined by the International Mouse Phenotyping Consortium’s list of essential mouse genes [82, 83].
f: Numbers in the parenthesis give the number of essential genes over the total number of ohnologs in the set (Dr_Ohno_all). Note that ohnolog pairs in zebrafish are by definition single-copy in gar and mouse, accounting for the smaller number of comparisons.
g: Numbers in the parenthesis give the number of essential genes over the total number of single-copy genes in the set (Dr_Sing_all).
The TGD and the zebrafish metabolic network.
| Network statistic | Ohnolog datasets compared | Mean ohnolog value | Mean single-copy gene value | |
|---|---|---|---|---|
| Node degree | 30.9 | 21.4 | ||
| 31.5 | 23.2 | |||
| Avg. clustering coeff. | 0.78 | 0.77 | >0.5 | |
| 0.77 | 0.76 | >0.5 | ||
| Mean # shortest paths | 18020 | 12904 | 0.07 | |
| 19280 | 14132 | 0.18 |
a: Mean value of the statistic in question for the ohnolog pairs (ohnolog pairs were merged and averaged prior to computing the global average).
b: Mean value of the statistic in question for the single-copy genes.
c: P-value for the hypothesis test of equal mean statistic value for the ohnologs and single-copy genes (Network randomization test; Methods).
d: Number of edges per network node.
e: Comparison of all identified zebrafish ohnologs to all zebrafish single-copy (with respect to the TGD) genes. See Methods for further details.
f: Comparison of all zebrafish ohnolog pairs used in the 8 species POInT analysis to the corresponding zebrafish single-copy (with respect to the TGD) genes. See Methods for further details.
g: Ratio of the number of edges between each triplet of nodes to the maximum number of such connections possible [85].
h: The mean of the number of shortest paths through the network that pass through a given node, also known as betweenness-centrality [86].