Literature DB >> 31842736

Distinguishing successive ancient polyploidy levels based on genome-internal syntenic alignment.

Yue Zhang¹, Chunfang Zheng¹, David Sankoff².

Abstract

BACKGROUND: A basic tool for studying the polyploidization history of a genome, especially in plants, is the distribution of duplicate gene similarities in syntenically aligned regions of a genome. This distribution can usually be decomposed into two or more components identifiable by peaks, or local maxima, each representing a different polyploidization event. The distributions may be generated by means of a discrete time branching process, followed by a sequence divergence model. The branching process, as well as the inference of fractionation rates based on it, requires knowledge of the ploidy level of each event, which cannot be directly inferred from the pair similarity distribution.
RESULTS: For a sequence of two events of unknown ploidy, either tetraploid, giving rise to whole genome doubling (WGD), or hexaploid, giving rise to whole genome tripling (WGT), we base our analysis on triples of similar genes. We calculate the probability of the four triplet types with origins in one or the other event, or both, and impose a mutational model so that the distribution resembles the original data. Using a ML transition point in the similarities between the two events as a discriminator for the hypothesized origin of each similarity, we calculate the predicted number of triplets of each type for each model combining WGT and/or WGD. This yields a predicted profile of triplet types for each model. We compare the observed and predicted triplet profiles for each model to confirm the polyploidization history of durian, poplar and cabbage.
CONCLUSIONS: We have developed a way of inferring the ploidy of up to three successive WGD and/or WGT events by estimating the time of origin of each of the similarities in triples of genes. This may be generalized to a larger number of events and to higher ploidies.

Entities: Chemical Disease Species

Keywords: Branching process; Gene triples; Plant genomes; Polyploidy; Whole genome doubling; Whole genome tripling

Mesh：

Year: 2019 PMID： 31842736 PMCID： PMC6915858 DOI： 10.1186/s12859-019-3202-x

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Given the pervasiveness of whole genome doubling (WGD) and tripling (WGT) in the ancestral lineages of plant species, a widespread feature of plant genome publications is the display of the distribution of duplicate gene identities (or similarities, distances, K,...). This is illustrated in Fig. 1, which represents the distribution of similarities between syntenically aligned duplicate genes [1, 2] in the durian (Durio zibethinus) genome [3]. The two visually distinguishable but overlapping parts of the distribution are the legacy of two successive polyploidization events. The means (t1 and t2), variances and proportion of the total sample of each component of the distribution can be estimated by mixtures of models techniques such as EMMIX [4].

Fig. 1

Distribution of gene pair similarities. Pairs in the Durio zibethinus (COGE ID 51764) genome after two rounds of whole genome tripling. Discrimination point H=85.2%. Cut-off for pairs not originating in polyploidization > 98% These distributions can be explained and generated by a discrete-time branching process model of polyploidization and fractionation (not time-homogeneous), mathematically represented by the product of successive r-nomial distributions, the output of one being the input of the next, where the parameters, namely the r-nomial probabilities, express resistance to gene loss through fractionation [5-9]. These parameters, which are key to understanding the cycle of polyploidization and fractionation, can be estimated using the information inferred from the distribution of gene pair similarities. One aspect of the branching process that cannot be inferred from the study of pairs only is the ploidy r of the various events. Thus, durian is known to have undergone two whole genome triplings [9, 10], the γ tripling almost 120 million years ago [11], ancestral to most flowering plants, and a more recent tripling (10-20 Mya) not shared by even closely related species, like cacao. By just looking at the distribution of similarities in Fig. 1 engendered by these events, however, there is no direct way of knowing whether one or both of the two component distributions represent WGD, WGT, or other polyploid events. To resolve this problem, the main point of this presentation, we propose to add triples of similar (> 50 %) genes to the study of gene pairs. Our technique, based on the branching process, responds to our concern in previous ad hoc treatments [7, 12] of how to use triples rigorously from a statistical point of view. In the next section, we summarize the general branching process approach to analyzing the distribution of gene pair similarities. We then focus on four competing two-event models involving WGD and/or WGT. We define four types of gene triplet according to whether the gene pairs within them were created by the first event, the second event, or both. Within each model, we calculate the expected number of triplets of each type. Thus creates an “underlying” profile of triplet distribution to compare to the “observed” profile of triplets in the data. Because of the way the two components of the pair similarity distribution overlap, however, the origin of each triplet in the data is not always obvious. Thus we create a “predicted” profile of triplet distribution by grafting a paralog divergence model onto the branching process, making use of a maximum likelihood dividing point between the two components. We apply this analysis to the genomes of durian, poplar Populus trichocarpa [13] and cabbage Brassica oleracea [14], each of which has a different sequence of polyploidization events. These histories are captured correctly for the first two, but the results for B. oleracea prompt an extension to three-event models, which we carry out, and suggest further work to higher numbers of events.

The branching process and two-event models

Denote by m the total number of individuals (genes)at time t,i=1,…,n. Set m1=1. At time t, i=1,…n−1, each of the m genes is replaced by r≥2 progeny, but only j≥1 of them survive until time t, with probability . Of the total of m genes at time t, let be the number for which j progeny survive until time t, so that The probability distribution of the evolutionary histories represented by the given and the variable is The expected number of genes at time t is This is illustrated by the sample trajectory in Fig. 2, in which a WGT at time t1, with all three progeny surviving – the 3-nomial sample has value 3 – is followed by another independent WGT at time t2 where the three lineages show one, two or all three offspring surviving, i.e., the independent 3-nomials samples have values 1,2 and 3, respectively. We will study the case of two successive polyploidy events, with r1 and r2 taking on values 2 or 3, i.e., WGD or WGT, in all four combinations, i.e., in the set of “models”, denoted (3,3),(3,2),(2,3) and (2,2). Because of the limited information that can be inferred about each component of the distribution of similarities, we can only infer the probabilities of samples of value 1, 2, or 3, so that we are limited to 2-nomials and, with some assumptions, 3-nomials, by far the biologically most important cases.

Fig. 2

Sample trajectory, starting from a single gene, of branching process based on two whole genome triplings

Sample trajectory, starting from a single gene, of branching process based on two whole genome triplings To infer parameters like fractionation rate in the polyploidization history of a genome, based on the distribution of gene pair similarities, we need to know the ploidies r of the various events. This motivates us to extend our study from gene pairs only to also include gene triplets.

Methods

Triplet probabilities in four models

With no loss of generality, we study triplets of similarities among three genes, rather than the triplets of genes themselves. A triplet is a (multi-)set {t,t,t}, where each of i,j and k may be 1 or 2. Let . We classify each kind of triplet we according to whether each of the three paralogies among the three pairs of gene originates in the first event or the second event. Thus in the branching process illustration in Fig. 2 the blue dots represent genes that form a {t2,t2,t2} triplet and the red dots form a {t1,t1,t2} triplet. The single red dot combines with the three pairs of blue genes to form three additional {t1,t1,t2} triplets. And there are a further nine {t1,t1,t2} triplets in the sample. We can calculate the expected number of triplets of each type by enumerating the triplets of each type in each possible trajectory of the process, and multiplying by the probability of this trajectory from Expression (2). The enumeration within a trajectory is easily done by considering every triple of genes at time t and identifying the last common ancestor of each pair. The probabilities of the trajectories have previously been calculated [5-7]. We then sum these results over all trajectories. These are listed in Table 1.

Table 1

Formulae for the expected numbers of triplets W(Δ) of each type Δ, by branching model M

Model \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$M\in \mathscr {M}$\end{document}M∈ℳ	Tripling-	Tripling-	Doubling-	Doubling-
	tripling	doubling	tripling	doubling
Triplet \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\Delta \in \mathscr {T}$\end{document}Δ∈T	(3,3)	(3,2)	(2,3)	(2,2)
{t₁,t₁,t₁}	u^′(1+2v^′+v)³	u^′(1+v)³	-	-
{t₁,t₁,t₂}	2(3u^′+u)(3v^′+v)	2(3u^′+u)	2u(3v^′+v)	(2u)v(1+v)
	×(1+2v^′+v)	×v(1+v)	×(1+2v^′+v)
{t₁,t₂,t₂}	-	-	-	-
{t₂,t₂,t₂}	(1+2u^′+u)v^′	-	(1+u)v^′	-

u= probability that two progeny survive after the first polyploidization event. u′= probability that three survive. Similarly v and v′ are the probabilities that two or three progeny survive, respectively, after the second event

Formulae for the expected numbers of triplets W(Δ) of each type Δ, by branching model M u= probability that two progeny survive after the first polyploidization event. u′= probability that three survive. Similarly v and v′ are the probabilities that two or three progeny survive, respectively, after the second event This table provides the expected number of triplets W(Δ) of each type produced by the branching process for each model . For a given genome, the four numerical values of W(·) constitute the underlying profile of the model M. The underlying profiles for each model based on maximum likelihood values of u and v are given in the top half of Tables 2, 3 and 4 below. Because of the limited number of parameters that can be inferred from the distribution of similarities, we assume u′=u2 and v′=v2.

Table 2

Durian model predictions before (underlying) and after imposition of mutational divergence

Shaded column indicates the model predicted by the literature and the closest fit to the observed profile

Table 3

Populus model predictions before (underlying) and after imposition of mutational divergence

Shaded column indicates the model predicted by the literature and the closest fit to the observed profile

Table 4

Brassica oleraceae model predictions before (underlying) and after imposition of mutational divergence

In the two shaded columns, the expected (2,3) profile does not fit the observed pattern as well as the (3,3) profile

Durian model predictions before (underlying) and after imposition of mutational divergence Shaded column indicates the model predicted by the literature and the closest fit to the observed profile Populus model predictions before (underlying) and after imposition of mutational divergence Shaded column indicates the model predicted by the literature and the closest fit to the observed profile Brassica oleraceae model predictions before (underlying) and after imposition of mutational divergence In the two shaded columns, the expected (2,3) profile does not fit the observed pattern as well as the (3,3) profile It can be seen in Table 1 that the profiles of triplet types produced by the different models are very different. If we could observe the triplet profile produced by the branching process underlying a given set of data, we could easily identify which model was responsible. However, we only see the data after mutational processes have applied. When a mutational divergence model is applied to the similarities, a single trajectory of the branching process, producing one ideal type of triplet, can produce many very different data triplets. We can try to categorize the set of triplets in a set of data by how closely they resemble one of the four basic types. If the two components of the similarity distribution were completely separate, this would also be an easy matter. But the usual large overlap between the components means that we cannot automatically ascribe any data triplet to any particular underlying triplet.

A statistical approach

As a solution to this problem, we first try to find the best transition, or cutoff point H somewhere between the peaks of the two components. For this we compute the product of the probability density at each similarity value less than H, according to the component with mean at t1, and the density at each similarity value greater than H, according to the component with mean at t2, and maximize with respect to H. I.e., We then categorize the triplets in the data according to the transition value H. If a similarity x is less than H we classify it as being produced at time t1, and we write x∈I. If it is greater than H, we classify it as being created at time t2 and we write x∈J. This creates eight “octants”, defined by the 8=23 combination of the three triplet similarities, which in turn are collapsed into the four types of triplet in tabulated in Table 1, {t1,t1,t1},{t1,t1,t2}(representing three octants), {t1,t2,t2} (representing three octants) and {t2,t2,t2}. The number of triplets of the four types we call the observed profile. Although we can compare the observed profile with the underlying profile, this comparison is not too meaningful since it neglects the fact that many of the data triplets classified as one type may be generated by a different underlying type, not as an error, but simply as a result of the normal process of duplicate gene sequence divergence clearly operative in the more or less dispersed and overlapping components of the distribution of gene pair similarities. We can, however, take this process into account in producing a predicted profile for each model. We first calculate the variance-covariance matrix Σ of the t1 similarities in triplets containing at least of them and t2 similarities in triplets containing at least two of these. We fixed covar(t1,t2)=0, in accordance with the Markov nature of the branching process. For each model , we construct the predicted profile of triplet types by integrating over the trivariate normal with means drawn from the EMMIX analysis or identified by eye with the distribution component peaks, and covariance estimated as above, restricted to the domains defined by the transition point. Thus our prediction of {t1,t1,t1} triplets would involve a restriction to the domain (I,I,I) where all three coordinates are less than or equal to H. Our prediction of apparently {t1,t1,t2} triplets would be confined to the three octants where two coordinates are less than or equal to H and one is greater. The integrals are weighted by W(Δ), the expected number of triplets. For example, the predicted number of {t1,t1,t2} triplets would be; where μ is the vector of component means. To summarize, we have defined three types of triplet profile: the observed profile, based on triples of genes all having high similarity scores with each other, which can be compiled from the list of gene pairs produced by the SYNMAP function of COGE [1, 2], the underlying profile for each model , produced by the combinatorial probabilities of the branching process for each model , the predicted profile for each model , produced by grafting a gene pair divergence model on the underlying profile. For comparative purposes we normalize the underlying and predicted profiles so that the total number of triples is the same as the observed profiles.

Results

We compare the three profiles for three well-studied flowering plant genomes that are known to have undergone multiple polyploidizations in the last 120 million years, to see if our method predicts the right combination of WGT and WGD.

Durian

Starting with the durian genome, the (3,3) model, known to represent true evolutionary history, is the only one with a credible prediction profile in Table 2, the only one that has reasonable values for all four triplet types. The three others all fail to predict one or both of the {t1,t1,t1} and {t2,t2,t2} triplets. This indicates the potential of our statistical method, since the original durian sequence article [3] did not recognize the second event as a tripling.

Poplar

The predicted profile of the (3,2) model in Table 3 summarizes the true history of the Populus trichocarpa genome (COGE ID 25127), whose gene pair similarity distribution is displayed in Fig. 3. Along with γ, this shares the ancient “salicoid” WGD with other members of the Salicaceae family [15]. (3,2) is the only model that correctly identifies both the γ event as a WGT, and the more recent event as a WGD.

Fig. 3

Distribution of syntenic gene pair similarities in Populus trichocarpa. Discrimination point H=84.5%. Cut-off for pairs not originating in polyploidization > 97.5%

Cabbage

The recent ancestor of Brassica oleracea genome (COGE ID 26018), underwent a WGT that gave rise not only to the crucifers and mustard genera, but also radishes and other related genera. Early than that a WGD called the α doubling is apparent in the whole range of family Brassicacea genera, including Arabidopsis. A still earlier WGD, the β doubling, occurred in the order Brassicales lineage that includes the Brassicaceae. Thus the cabbage genome counts γ,β,α and a Brassica WGT in its evolutionary history [16, 17]. In Fig. 4, we see that at least the two recent components are clearly distinguishable, so we first carried out an analysis excluding gene pairs of similarity less than 76%.

Fig. 4

Distribution of syntenic gene pair similarities in Brassica oleracea. Discrimination point H=83.3%. Cut-off for pairs not originating in two recent polyploidization ≤76%

Distribution of syntenic gene pair similarities in Brassica oleracea. Discrimination point H=83.3%. Cut-off for pairs not originating in two recent polyploidization ≤76% This analysis in Table 4 does not give satisfactory results. Indeed, the (3,3) profile matches the observed profile much better than the (2,3) profile does. This can be partially attributed to substantial number of similarities generated by the β and even γ doublings greater than 76%. We can partially correct this by adding a third event to our branching process. This leads to eight models instead of four, and ten kinds of triplet, summarized in Table 5.

Table 5

Formulae for the expected numbers of triplets after three events

Model \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$M\in \mathscr {M}$\end{document}M∈ℳ
Triplet \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\Delta \in \mathscr {T}$\end{document}Δ∈T	(2,2,2)	(3,2,2)	(2,3,2)
{t1,t1,t1}	-	u^′(1+v)³(1+w)³	-
{t1,t1,t2}	2uv(1+v)(1+w)³	2(3u^′+u)v(1+v)(1+w)³	2u(3v^′+v)
			×(1+2v^′+v)(1+w)³
{t1,t2,t2}	-	-	-
{t2,t2,t2}	-	-	(1+u)v^′(1+w)³
{t2;t2;t3}	2(1+u)vw(1+w)	2(1+2u^′+u)vw(1+w)	2(1+u)(3v^′+v)w(1+w)
{t2,t3,t3}	-	-	-
{t3,t3,t3}	-	-	-
{t1,t1,t3}	2u(1+v)²w(1+w)	2(3u^′+u)(1+v)²w(1+w)	2u(1+2v^′+v)²w(1+w)
{t1,t3,t3}	-	-	-
{t1,t2,t3}	-	-	-
model \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$M\in \mathscr {M}$\end{document}M∈ℳ
triplet \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\Delta \in \mathscr {T}$\end{document}Δ∈T	(2,2,3)	(2,3,3)	(3,2,3)
{t1,t1,t1}	-	-	u^′(1+v)³
			×(1+2w^′+w)³
{t1,t1,t2}	2uv(1+v)	2u(3v^′+v)(1+2v^′+v)	2(3u^′+u)v(1+v)
	×(1+2w^′+w)³	×(1+2w^′+w)³	×(1+2w^′+w)³
{t1,t2,t2}	-	-	-
{t2,t2,t2}	-	(1+u)v^′	-
		×(1+2w^′+w)³
{t2;t2;t3}	2(1+u)v	2(1+u)(3v^′+v)	2(1+2u^′+u)v
	×(3w^′+w)(1+2w^′+w)	×(3w^′+w)(1+2w^′+w)	×(3w^′+w)(1+2w^′+w)
{t2,t3,t3}	-	-	-
{t3,t3,t3}	(1+u)(1+v)w^′	(1+u)(1+2v^′+v)w^′	(1+2u^′+u)(1+v)w^′
{t1,t1,t3}	2u(1+v)²	2u(1+2v^′+v)²	2(3u^′+u)(1+v)²
	×(3w^′+w)(1+2w^′+w)	×(3w^′+w)(1+2w^′+w)	×(3w^′+w)(1+2w^′+w)
{t1,t3,t3}	-	-	-
{t1,t2,t3}	-	-	-
model \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$M\in \mathscr {M}$\end{document}M∈ℳ
triplet \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\Delta \in \mathscr {T}$\end{document}Δ∈T	(3,3,2)	(3,3,3)
{t1,t1,t1}	u^′(1+2v^′+v)³(1+w)³	u^′(1+2v^′+v)³(1+2w^′+w)³
{t1,t1,t2}	2(3u^′+u)(3v^′+v)(1+2v^′+v)(1+w)³	2(3u^′+u)(3v^′+v)(1+2v^′+v)(1+2w^′+w)³
{t1,t2,t2}	-	-
{t2,t2,t2}	(1+2u^′+u)v^′(1+w)³	(1+2u^′+u)v^′(1+2w^′+w)³
{t2;t2;t3}	2(1+2u^′+u)(3v^′+v)w(1+w)	2(1+2u^′+u)(3v^′+v)(3w^′+w)(1+2w^′+w)
{t2,t3,t3}	-	-
{t3,t3,t3}	-	(1+2u^′+u)(1+2v^′+v)w^′
{t1,t1,t3}	2(3u^′+u)(1+2v^′+v)²w(1+w)	2(3u^′+u)(1+2v^′+v)²(3w^′+w)(1+2w^′+w)
{t1,t3,t3}	-	-
{t1,t2,t3}	-	-

Formulae for the expected numbers of triplets after three events u= probability that two progeny survive after the first polyploidization event. u′= probability that three survive. Similarly v and v′ are the probabilities that two or three progeny survive, respectively, after the second event. w and w′ are the probabilities that two or three progeny survive after the third event We fix the mean of the first component at 71% to account for the γ event, that for the second component, representing the α event, at 79.5% and we find two ML discrimination points, as in Fig. 5.

Fig. 5

Distribution decomposed into three events. Discrimination points H1=73%,H2=85%

Distribution decomposed into three events. Discrimination points H1=73%,H2=85% The results of this are shown in Table 6. Here the (3,2,3) model is just as close to the observed profile than the competing (3,3,3) model, with the notable exception of {t2,t2,t2} triplets. The absence of a distinction between the α and β events means that the similarities they generate are all conflated to yield an excess of t2, and consequently an excess of t2 triples, so that a WGT is inferred rather than a WGD.

Table 6

Brassica oleracea three-event model predictions before (underlying) and after imposition of mutational divergence

In the two shaded columns, the expected (3,2,3) profile fits the observed pattern as well as the (3,3,3) profile

Brassica oleracea three-event model predictions before (underlying) and after imposition of mutational divergence In the two shaded columns, the expected (3,2,3) profile fits the observed pattern as well as the (3,3,3) profile The obvious remedy for this would be to construct four-event models (sixteen of them), with profiles consisting of 20 different triplets. We leave this for further work. In general the number of models is exponential: 2 for m events, while the number of triples follows the polynomial (i.e., cubic) tetrahedral sequence (A00292 in [18]) , so that eventually there would not be enough data to discriminate among the models. Choosing among models with different numbers of events would require some standard for model selection such as the Akaike or Bayesian information criteria.

Conclusions

We model the process of fractionation to account for the distribution of gene pair similarities after a number of whole genome doublings, triplings, etc., each followed by a period of duplicate gene loss. The model is a discrete-time branching process, with synchronous birth number r≥2 across the i−th generation population and deaths determined by a r-nomial law conditioned on at least one survivor. The observations of gene pair similarities consist of a mixture of normals, each component generated by one event, with the event time estimated by the sequence divergence from the event to the present. Despite the overlapping distributions, we can estimate the mean (via a local mode), standard deviation and proportion of the sample. Statistics on gene pairs alone do not allow us to infer r, so we introduce the study of gene triplets. We find formulae for the expected number of each kind of triplet, categorized as to which events produced the similarities among the three pairs of genes. We develop a way of grafting a gene divergence model on this underlying profile of triplets to produce a predicted profile of the number of triplets of each kind. This can then be compared with the observed number of triplets.

Further work

Distinguishing among the four models combining tetraploidization or hexaploidization in two successive events is the simplest example of a more general problem. The theoretical way is clear to extending these ideas to include, for example, octoploidization through the extraction of quadruplets instead of triplets. In addition, there is a straight forward extension to the case of three or more successive polyploidization events, which we have undertaken in the study of Brassica oleracea. Here, three events, three normal components and two transition points are estimated from the distribution of similarities. The combinatorial probabilities have been worked out for this case and many others, and the methodology is available to complete this. It is true that in some cases, such as that we presented in [9], concerning Durio zibethinus, the ploidy is evident from the clear presence in the SYNMAP self-comparison dotplots of sets of r regions covering a large proportion of the genome, each set represented by exactly synteny blocks showing synteny among all r regions. (r=3 in the case of Durio.) Clear cases like this are rare, however, especially for genomes where the last polyploidization is more remote in time. In previous work [7], we used additional information, beyond that contained in the similarity distribution, to confirm the recent hexaploidization of Brassica rapa against the alternative of tetraploidization. This kind of data, however, namely speculation about the number of single-copy genes in the current genome, was extremely subjective in that report, and is unreliable even when assessed by the best available methods on well-assembled and annotated genomes. A distribution of gene pair similarities is generated in the comparison of two related genomes as well as in the self-comparison of a single genome. The number of orthologous gene pairs available when comparing two related genomes is generally much greater than the number of paralogous pairs identified in the self-comparison of two genomes, simply because the loss by fractionation of one copy of a duplicated gene does not eliminate all related orthology pairs: the other remaining copy and its orthology pairs remain intact. This suggests an avenue to improved accuracy of polyploidy levels inference. The larger number of data, however, may not always compensate for the fact that the speciation component of the similarity distribution is always the most recent one [9], so that the more remote (earlier) components associated with polyploidy become statistically less clear and informative.

15 in total

1. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray).

Authors: G A Tuskan; S Difazio; S Jansson; J Bohlmann; I Grigoriev; U Hellsten; N Putnam; S Ralph; S Rombauts; A Salamov; J Schein; L Sterck; A Aerts; R R Bhalerao; R P Bhalerao; D Blaudez; W Boerjan; A Brun; A Brunner; V Busov; M Campbell; J Carlson; M Chalot; J Chapman; G-L Chen; D Cooper; P M Coutinho; J Couturier; S Covert; Q Cronk; R Cunningham; J Davis; S Degroeve; A Déjardin; C Depamphilis; J Detter; B Dirks; I Dubchak; S Duplessis; J Ehlting; B Ellis; K Gendler; D Goodstein; M Gribskov; J Grimwood; A Groover; L Gunter; B Hamberger; B Heinze; Y Helariutta; B Henrissat; D Holligan; R Holt; W Huang; N Islam-Faridi; S Jones; M Jones-Rhoades; R Jorgensen; C Joshi; J Kangasjärvi; J Karlsson; C Kelleher; R Kirkpatrick; M Kirst; A Kohler; U Kalluri; F Larimer; J Leebens-Mack; J-C Leplé; P Locascio; Y Lou; S Lucas; F Martin; B Montanini; C Napoli; D R Nelson; C Nelson; K Nieminen; O Nilsson; V Pereda; G Peter; R Philippe; G Pilate; A Poliakov; J Razumovskaya; P Richardson; C Rinaldi; K Ritland; P Rouzé; D Ryaboy; J Schmutz; J Schrader; B Segerman; H Shin; A Siddiqui; F Sterky; A Terry; C-J Tsai; E Uberbacher; P Unneberg; J Vahala; K Wall; S Wessler; G Yang; T Yin; C Douglas; M Marra; G Sandberg; Y Van de Peer; D Rokhsar
Journal: Science Date: 2006-09-15 Impact factor: 47.728

Review 2. How to usefully compare homologous plant genes and chromosomes as DNA sequences.

Authors: Eric Lyons; Michael Freeling
Journal: Plant J Date: 2008-02 Impact factor: 6.417

3. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla.

Authors: Olivier Jaillon; Jean-Marc Aury; Benjamin Noel; Alberto Policriti; Christian Clepet; Alberto Casagrande; Nathalie Choisne; Sébastien Aubourg; Nicola Vitulo; Claire Jubin; Alessandro Vezzi; Fabrice Legeai; Philippe Hugueney; Corinne Dasilva; David Horner; Erica Mica; Delphine Jublot; Julie Poulain; Clémence Bruyère; Alain Billault; Béatrice Segurens; Michel Gouyvenoux; Edgardo Ugarte; Federica Cattonaro; Véronique Anthouard; Virginie Vico; Cristian Del Fabbro; Michaël Alaux; Gabriele Di Gaspero; Vincent Dumas; Nicoletta Felice; Sophie Paillard; Irena Juman; Marco Moroldo; Simone Scalabrin; Aurélie Canaguier; Isabelle Le Clainche; Giorgio Malacrida; Eléonore Durand; Graziano Pesole; Valérie Laucou; Philippe Chatelet; Didier Merdinoglu; Massimo Delledonne; Mario Pezzotti; Alain Lecharny; Claude Scarpelli; François Artiguenave; M Enrico Pè; Giorgio Valle; Michele Morgante; Michel Caboche; Anne-Françoise Adam-Blondon; Jean Weissenbach; Francis Quétier; Patrick Wincker
Journal: Nature Date: 2007-08-26 Impact factor: 49.962

4. Models for Similarity Distributions of Syntenic Homologs and Applications to Phylogenomics.

Authors: David Sankoff; Chunfang Zheng; Yue Zhang; Joao Meidanis; Eric Lyons; Haibao Tang
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2018-07-31 Impact factor: 3.710

5. Brassicales phylogeny inferred from 72 plastid genes: A reanalysis of the phylogenetic localization of two paleopolyploid events and origin of novel chemical defenses.

Authors: Patrick P Edger; Jocelyn C Hall; Alex Harkess; Michelle Tang; Jill Coombs; Setareh Mohammadin; M Eric Schranz; Zhiyong Xiong; James Leebens-Mack; Blake C Meyers; Kenneth J Sytsma; Marcus A Koch; Ihsan A Al-Shehbaz; J Chris Pires
Journal: Am J Bot Date: 2018-03-25 Impact factor: 3.844

6. Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar, and grape: CoGe with rosids.

Authors: Eric Lyons; Brent Pedersen; Josh Kane; Maqsudul Alam; Ray Ming; Haibao Tang; Xiyin Wang; John Bowers; Andrew Paterson; Damon Lisch; Michael Freeling
Journal: Plant Physiol Date: 2008-10-24 Impact factor: 8.340

7. Paleopolyploidy in the Brassicales: analyses of the Cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other Brassicales.

Authors: Michael S Barker; Heiko Vogel; M Eric Schranz
Journal: Genome Biol Evol Date: 2009-10-05 Impact factor: 3.416

8. The draft genome of tropical fruit durian (Durio zibethinus).

Authors: Bin Tean Teh; Kevin Lim; Chern Han Yong; Cedric Chuan Young Ng; Sushma Ramesh Rao; Vikneswari Rajasegaran; Weng Khong Lim; Choon Kiat Ong; Ki Chan; Vincent Kin Yuen Cheng; Poh Sheng Soh; Sanjay Swarup; Steven G Rozen; Niranjan Nagarajan; Patrick Tan
Journal: Nat Genet Date: 2017-10-09 Impact factor: 38.330

9. A branching process for homology distribution-based inference of polyploidy, speciation and loss.

Authors: Yue Zhang; Chunfang Zheng; David Sankoff
Journal: Algorithms Mol Biol Date: 2019-08-01 Impact factor: 1.405

10. The avocado genome informs deep angiosperm phylogeny, highlights introgressive hybridization, and reveals pathogen-influenced gene space adaptation.

Authors: Martha Rendón-Anaya; Enrique Ibarra-Laclette; Alfonso Méndez-Bravo; Tianying Lan; Chunfang Zheng; Lorenzo Carretero-Paulet; Claudia Anahí Perez-Torres; Alejandra Chacón-López; Gustavo Hernandez-Guzmán; Tien-Hao Chang; Kimberly M Farr; W Brad Barbazuk; Srikar Chamala; Marek Mutwil; Devendra Shivhare; David Alvarez-Ponce; Neena Mitter; Alice Hayward; Stephen Fletcher; Julio Rozas; Alejandro Sánchez Gracia; David Kuhn; Alejandro F Barrientos-Priego; Jarkko Salojärvi; Pablo Librado; David Sankoff; Alfredo Herrera-Estrella; Victor A Albert; Luis Herrera-Estrella
Journal: Proc Natl Acad Sci U S A Date: 2019-08-06 Impact factor: 11.205

2 in total

1. GGDB: A Grameneae genome alignment database of homologous genes hierarchically related to evolutionary events.

Authors: Qihang Yang; Tao Liu; Tong Wu; Tianyu Lei; Yuxian Li; Xiyin Wang
Journal: Plant Physiol Date: 2022-08-29 Impact factor: 8.005

2. Integrated synteny- and similarity-based inference on the polyploidization-fractionation cycle.

Authors: Yue Zhang; Zhe Yu; Chunfang Zheng; David Sankoff
Journal: Interface Focus Date: 2021-06-11 Impact factor: 3.906

2 in total