Literature DB >> 30398564

wgd-simple command line tools for the analysis of ancient whole-genome duplications.

Arthur Zwaenepoel^1,2,3, Yves Van de Peer^1,2,3,4.

Abstract

SUMMARY: Ancient whole-genome duplications (WGDs) have been uncovered in almost all major lineages of life on Earth and the search for traces or remnants of such events has become standard practice in most genome analyses. This is especially true for plants, where ancient WGDs are abundant. Common approaches to find evidence for ancient WGDs include the construction of KS distributions and the analysis of intragenomic colinearity. Despite the increased interest in WGDs and the acknowledgment of their evolutionary importance, user-friendly and comprehensive tools for their analysis are lacking. Here, we present an easy to use command-line tool for KS distribution construction named wgd. The wgd suite provides commonly used KS and colinearity analysis workflows together with tools for modeling and visualization, rendering these analyses accessible to genomics researchers in a convenient manner.
AVAILABILITY AND IMPLEMENTATION: wgd is free and open source software implemented in Python and is available at https://github.com/arzwa/wgd. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2019 PMID： 30398564 PMCID： PMC6581438 DOI： 10.1093/bioinformatics/bty915

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

In this era of whole-genome sequencing, many ancient whole-genome duplication (WGD) events have been uncovered across the eukaryotic tree of life (Van de Peer ). One of the main approaches for revealing ancient WGDs using genomic data is the construction of whole paranome KS distributions (e.g. Blanc and Wolfe, 2004; Cui ; Lynch and Conery, 2000; Vanneste ), where KS is the synonymous distance or the estimated number of synonymous substitutions per synonymous site. Under the assumption of neutral evolution at synonymous sites, the synonymous distance between two coding sequences serves as a proxy for the divergence time of two sequences. Under a model of continuous small-scale gene duplication (SSD) and loss of duplicated copies not under selection, a whole paranome KS distribution is expected to show an exponential decay of the number of retained duplicates in function of age (Blanc and Wolfe, 2004; Lynch and Conery, 2000). Against this background of SSDs, large-scale duplication events, such as WGDs, are visible as peaks in the number of retained duplicates at a particular age. Several issues compromise the use of KS distributions for WGD inference, and these were extensively addressed in Vanneste . When high-quality genome assemblies are available, gene colinearity (often called synteny) based analyses may further aid in unveiling WGDs or large segmental duplications (Van de Peer, 2004). WGDs are expected to leave large blocks with high intragenomic colinearity, and paralogs located in such colinear segments (anchor pairs) can therefore be traced back more reliably to a particular event, enabling their use for downstream analyses such as molecular dating (Vanneste ) or functional analysis. While these methods have been used frequently in genomics research, no comprehensive and user-friendly software is available to perform these analyses, and researchers have often resorted to custom pipelines. Here, we fill this gap with an integrated suite for KS and colinearity based analysis of ancient WGDs. We briefly discuss the methods implemented here, but refer to the documentation and Supplementary Material for more information.

2 Materials and methods

2.1 Gene family delineation

Delineation of paralogous gene families and one-to-one orthologs starts from all-versus-all BLASTp similarity searches or precomputed BLAST results and is performed using ‘wgd mcl’. For whole paranome delineation, MCL (van Dongen, 2000) is then used to cluster sequences in paralogous gene families. One-to-one orthologs are determined using the commonly employed reciprocal best hit strategy.

2.2 KS distribution construction

A KS distribution for a set of paralogous families or one-to-one orthologs can be constructed using the ‘wgd ksd’ subcommand, and we closely follow the approach used by Vanneste . We refrain from a full description of the methodology here and refer to the Supplementary Material instead.

2.3 Colinearity analyses

When high-quality structural genome annotations are available, the ‘wgd syn’ tool allows the identification of intragenomic colinear blocks and their corresponding anchor pairs using I-ADHoRe 3.0 (Proost ). Whole-genome syntenic dotplots are generated, and if a KS distribution is provided, KS-colored dotplots and anchor pair KS distributions are generated (Fig. 1).

Fig. 1.

Illustration of the various tools and visualizations in wgd. (A) Arabidopsis thaliana and Carica papaya paranome KS distributions overlayed with the KS distribution of anchor pairs for A. thaliana and KS distribution of one-to-one orthologs of C. papaya and A. thaliana. (B) Mixture of three log-normal distributions fitted to the KS distribution of A. thaliana, using the Variational Bayes algorithm with γ = 10−3. (C) Plot showing the probability to belong to a particular component of the mixture shown in (B) in function of KS. These probabilities can be used to define component-wise paralogs for further downstream analyses. (D) KS-colored dotplot for A. thaliana, showing colinear blocks identified by I-ADHoRe, colored by their median KS value. (E) Interactive histogram visualization (user interface not shown, see Supplementary Fig. S1), showing the whole paranome KS distributions using histograms and kernel density estimates for A. thaliana and C. papaya together with the KS distribution of one-to-one orthologs in these species. We refer to the Supplementary Material for detailed methods

2.4 Kernel density estimation and mixture modeling

Downstream analyses of KS distributions have often consisted in fitting statistical models and visualizing these. We provide tools (‘wgd kde’) for fitting kernel density estimates (KDEs). Importantly, we apply a correction for boundary effects, which are often neglected but may lead to artificial peaks in low KS regions. As peaks derived from WGDs are expected to be approximately log-normally distributed, Gaussian mixture models (GMMs) have also been used frequently to analyze KS distributions. We provide tools (‘wgd mix’) for fitting mixtures of log-normal components using different inference algorithms, implemented using the scikit-learn python library (Pedregosa ). Common approaches to determine the optimal number of components are provided, using the Akaike or Bayesian information criterion, however we would like to warn prospective users to carefully interpret ‘significant’ components, as these GMMs may strongly overfit the empirical distribution (Tiley ).

2.5 Interactive visualization

Lastly, we provide tools for (interactive) visualization of histograms and KDEs in ‘wgd viz’ (Fig. 1). These tools allow visualization of multiple KS distributions for comparative purposes as well as modification of key visualization parameters such as the histogram bin-width or the KDE bandwidth. We encourage researchers to modify and explore the influence of these to guide careful analysis of the distributions and to prevent misinterpretations of KDE or histogram artifacts as biologically interesting features.

3 Conclusion

We provide, to our knowledge, the first comprehensive toolshed for KS and colinearity based analysis of WGDs in an easy to use and freely available package named wgd. We hope that, besides being a useful tool for researchers, it will also aid in preventing common pitfalls and misinterpretations when analyzing putative WGDs in genomic data.

Funding

This work was supported by the European Union Seventh Framework Programme (FP7/2007-2013) under European Research Council Advanced Grant Agreement 322739—DOUBLEUP [to Y.V.d.P]; and a PhD Fellowship of the Research Foundation—Flanders (FWO) [to A.Z.]. Conflict of Interest: none declared. Click here for additional data file.

9 in total

1. The evolutionary fate and consequences of duplicate genes.

Authors: M Lynch; J S Conery
Journal: Science Date: 2000-11-10 Impact factor: 47.728

Review 2. Computational approaches to unveiling ancient genome duplications.

Authors: Yves Van de Peer
Journal: Nat Rev Genet Date: 2004-10 Impact factor: 53.242

Review 3. The evolutionary significance of polyploidy.

Authors: Yves Van de Peer; Eshchar Mizrachi; Kathleen Marchal
Journal: Nat Rev Genet Date: 2017-05-15 Impact factor: 53.242

4. Inference of genome duplications from age distributions revisited.

Authors: Kevin Vanneste; Yves Van de Peer; Steven Maere
Journal: Mol Biol Evol Date: 2012-08-30 Impact factor: 16.240

5. Widespread genome duplications throughout the history of flowering plants.

Authors: Liying Cui; P Kerr Wall; James H Leebens-Mack; Bruce G Lindsay; Douglas E Soltis; Jeff J Doyle; Pamela S Soltis; John E Carlson; Kathiravetpilla Arumuganathan; Abdelali Barakat; Victor A Albert; Hong Ma; Claude W dePamphilis
Journal: Genome Res Date: 2006-05-15 Impact factor: 9.043

6. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes.

Authors: Guillaume Blanc; Kenneth H Wolfe
Journal: Plant Cell Date: 2004-06-18 Impact factor: 11.277

7. i-ADHoRe 3.0--fast and sensitive detection of genomic homology in extremely large data sets.

Authors: Sebastian Proost; Jan Fostier; Dieter De Witte; Bart Dhoedt; Piet Demeester; Yves Van de Peer; Klaas Vandepoele
Journal: Nucleic Acids Res Date: 2011-11-18 Impact factor: 16.971

8. Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous-Paleogene boundary.

Authors: Kevin Vanneste; Guy Baele; Steven Maere; Yves Van de Peer
Journal: Genome Res Date: 2014-05-16 Impact factor: 9.043

9. Assessing the Performance of Ks Plots for Detecting Ancient Whole Genome Duplications.

Authors: George P Tiley; Michael S Barker; J Gordon Burleigh
Journal: Genome Biol Evol Date: 2018-11-01 Impact factor: 3.416

9 in total

44 in total

1. Origin of horsetails and the role of whole-genome duplication in plant macroevolution.

Authors: James W Clark; Mark N Puttick; Philip C J Donoghue
Journal: Proc Biol Sci Date: 2019-10-30 Impact factor: 5.349

2. Phylotranscriptomics of Theaceae: generic-level relationships, reticulation and whole-genome duplication.

Authors: Qiong Zhang; Lei Zhao; Ryan A Folk; Jian-Li Zhao; Nelson A Zamora; Shi-Xiong Yang; Douglas E Soltis; Pamela S Soltis; Lian-Ming Gao; Hua Peng; Xiang-Qin Yu
Journal: Ann Bot Date: 2022-03-23 Impact factor: 4.357

3. The genome of homosporous maidenhair fern sheds light on the euphyllophyte evolution and defences.

Authors: Yuhan Fang; Xing Qin; Qinggang Liao; Ran Du; Xizhi Luo; Qian Zhou; Zhen Li; Hengchi Chen; Wanting Jin; Yaning Yuan; Pengbo Sun; Rui Zhang; Jiao Zhang; Li Wang; Shifeng Cheng; Xueyong Yang; Yuehong Yan; Xingtan Zhang; Zhonghua Zhang; Shunong Bai; Yves Van de Peer; William John Lucas; Sanwen Huang; Jianbin Yan
Journal: Nat Plants Date: 2022-09-01 Impact factor: 17.352

4. Chromosome-scale genome assembly of Rhododendron molle provides insights into its evolution and terpenoid biosynthesis.

Authors: Guo-Lin Zhou; Yong Li; Fei Pei; Ting Gong; Tian-Jiao Chen; Jing-Jing Chen; Jin-Ling Yang; Qi-Han Li; Shi-Shan Yu; Ping Zhu
Journal: BMC Plant Biol Date: 2022-07-15 Impact factor: 5.260

5. The Antarctic Moss Pohlia nutans Genome Provides Insights Into the Evolution of Bryophytes and the Adaptation to Extreme Terrestrial Habitats.

Authors: Shenghao Liu; Shuo Fang; Bailin Cong; Tingting Li; Dan Yi; Zhaohui Zhang; Linlin Zhao; Pengying Zhang
Journal: Front Plant Sci Date: 2022-06-17 Impact factor: 6.627

6. The genome of hibiscus hamabo reveals its adaptation to saline and waterlogged habitat.

Authors: Zhiquan Wang; Jia-Yu Xue; Shuai-Ya Hu; Fengjiao Zhang; Ranran Yu; Dijun Chen; Yves Van de Peer; Jiafu Jiang; Aiping Song; Longjie Ni; Jianfeng Hua; Zhiguo Lu; Chaoguang Yu; Yunlong Yin; Chunsun Gu
Journal: Hortic Res Date: 2022-03-23 Impact factor: 7.291

7. The Pharus latifolius genome bridges the gap of early grass evolution.

Authors: Peng-Fei Ma; Yun-Long Liu; Gui-Hua Jin; Jing-Xia Liu; Hong Wu; Jun He; Zhen-Hua Guo; De-Zhu Li
Journal: Plant Cell Date: 2021-05-31 Impact factor: 11.277

8. Draft Genomes of Two Artocarpus Plants, Jackfruit (A. heterophyllus) and Breadfruit (A. altilis).

Authors: Sunil Kumar Sahu; Min Liu; Anna Yssel; Robert Kariba; Samuel Muthemba; Sanjie Jiang; Bo Song; Prasad S Hendre; Alice Muchugi; Ramni Jamnadass; Shu-Min Kao; Jonathan Featherston; Nyree J C Zerega; Xun Xu; Huanming Yang; Allen Van Deynze; Yves Van de Peer; Xin Liu; Huan Liu
Journal: Genes (Basel) Date: 2019-12-24 Impact factor: 4.096

9. A chromosome-level genome assembly of rugged rose (Rosa rugosa) provides insights into its evolution, ecology, and floral characteristics.

Authors: Fei Chen; Liyao Su; Shuaiya Hu; Jia-Yu Xue; Hui Liu; Guanhua Liu; Yifan Jiang; Jianke Du; Yushan Qiao; Yannan Fan; Huan Liu; Qi Yang; Wenjie Lu; Zhu-Qing Shao; Jian Zhang; Liangsheng Zhang; Feng Chen; Zong-Ming Max Cheng
Journal: Hortic Res Date: 2021-06-18 Impact factor: 6.793

10. Gene-rich UV sex chromosomes harbor conserved regulators of sexual development.

Authors: Sarah B Carey; Jerry Jenkins; John T Lovell; Florian Maumus; Avinash Sreedasyam; Adam C Payton; Shengqiang Shu; George P Tiley; Noe Fernandez-Pozo; Adam Healey; Kerrie Barry; Cindy Chen; Mei Wang; Anna Lipzen; Chris Daum; Christopher A Saski; Jordan C McBreen; Roth E Conrad; Leslie M Kollar; Sanna Olsson; Sanna Huttunen; Jacob B Landis; J Gordon Burleigh; Norman J Wickett; Matthew G Johnson; Stefan A Rensing; Jane Grimwood; Jeremy Schmutz; Stuart F McDaniel
Journal: Sci Adv Date: 2021-06-30 Impact factor: 14.136