Literature DB >> 23180799

PGDD: a database of gene and genome duplication in plants.

Tae-Ho Lee¹, Haibao Tang, Xiyin Wang, Andrew H Paterson.

Abstract

Genome duplication (GD) has permanently shaped the architecture and function of many higher eukaryotic genomes. The angiosperms (flowering plants) are outstanding models in which to elucidate consequences of GD for higher eukaryotes, owing to their propensity for chromosomal duplication or even triplication in a few cases. Duplicated genome structures often require both intra- and inter-genome alignments to unravel their evolutionary history, also providing the means to deduce both obvious and otherwise-cryptic orthology, paralogy and other relationships among genes. The burgeoning sets of angiosperm genome sequences provide the foundation for a host of investigations into the functional and evolutionary consequences of gene and GD. To provide genome alignments from a single resource based on uniform standards that have been validated by empirical studies, we built the Plant Genome Duplication Database (PGDD; freely available at http://chibba.agtec.uga.edu/duplication/), a web service providing synteny information in terms of colinearity between chromosomes. At present, PGDD contains data for 26 plants including bryophytes and chlorophyta, as well as angiosperms with draft genome sequences. In addition to the inclusion of new genomes as they become available, we are preparing new functions to enhance PGDD.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 23180799 PMCID： PMC3531184 DOI： 10.1093/nar/gks1104

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Most higher organisms pass through different ploidy levels at different stages of development (1,2) and continuously produce aberrant unreduced gametes at low rates. However, the extreme rarity of genome duplications (GDs) in the evolutionary history of extant lineages, occurring only once in many (sometimes hundreds of) millions of years, shows that the vast majority of GD events quickly go extinct. For the rare survivors, classical views suggest that GD is potentially advantageous as a primary source of genes with new (3,4) or modified functions (5). The angiosperms (flowering plants) are an outstanding model in which to elucidate consequences of GD in higher eukaryotes. Gene-order conservation in vertebrates is evident after hundreds of millions of years of divergence (6,7). However, the two major branches of the angiosperms (eudicots and monocots), estimated to have diverged 125–140 MY (8) to 170–235 MYA (9) show much more rapid structural evolution, owing largely to their propensity for chromosomal duplication and subsequent gene loss (10), fragmenting ancestral linkage arrangements across multiple chromosomes (11–13). All angiosperm genomes published to date have shown evidence of paleopolyploidy (14). Although new data from yeast (15–17) and Paramecium (18) are shedding valuable light on consequences of GD in microbes, these consequences are expected to be very different in organisms with small effective population sizes such as angiosperms, mammals and other higher eukaryotes (19,20). For example, neofunctionalization is much more likely to occur in large populations, which contain more targets for mutations conferring new beneficial function. In contrast, subfunctionalization is improbable in large populations, as a partially subfunctionalized allele (the first step in the process) is more likely to be silenced by secondary mutations before reaching fixation by drift (19). A host of investigations into the functional and evolutionary consequences of gene and GD may be empowered by genome alignments from a single resource based on uniform standards that have been validated by empirical studies. Algorithms commonly used in vertebrate genome alignments focus on identifying orthologous regions, as GDs are rare, and ancient and paralogous regions are often so diverged as to be unrecognizable. However, to reveal the consequences of the more recent and more frequent GDs in angiosperms and other taxa, identifying paralogous regions is of central importance, necessitating the use of multiple alignments, both within and among genomes. To tackle such problems, we implemented a multiple gene-order alignment tool MCScan, which reflects better the true relationships among angiosperm genomes, in which GDs are frequently superimposed on speciations (21). Further, to empower comparative and functional studies across (and potentially beyond) the burgeoning set of plant genome sequences available, we built the Plant Genome Duplication Database (PGDD), a web service providing synteny information in terms of gene colinearity, both within and between genomes. Besides PGDD, comparative genomic data are available from some public databases such as CoGe (22), Phytozome (23), GreenPhylDB (24) and PLAZA (25). CoGe (22) provides comparative data across all species in any state of assembly by computation on the fly while this allows greater flexibility on the user-end, non-specialists who are searching for a well-curated resource may find it cumbersome to use. In green plant, Phytozome (23) and GreenPhylDB (24) provide well-controlled micro-synteny and gene family evolution data, but macro-synteny data are not supported by the databases. PLAZA (25) provides fine macro-synteny data in plants, as well as micro-synteny and gene family data such as PGDD. However, there are some differences between PGDD and PLAZA in colinearity data because to identify colinear gene pairs, PLAZA adopted i-ADHoRe (26) of which power and precision differ from MCScan (27) used in PGDD. For the past 5 years, PGDD has provided data about syntenic relationships based on colinear blocks between plants and contributed to much research such as evolution of gene families (28–34), annotations (35–38) and polyploidy events (39–43). PGDD also provides an easily linked data web resource to be readily integrated to other external informatics portal, including TAIR (44), Legume Information System (45) and PopGenIE (46). In the past year alone, we have developed a new pipeline to promptly merge new genome data into the database and nearly tripled the number of genomes archived. At present, PGDD contains data for 26 plants including bryophytes and chlorophytes, as well as angiosperms (Table 1).

Table 1.

List of 26 plants currently served by PGDD

Species name	Common name	Release version	Gene number
Arabidopsis lyrata	Lyrate rockcress	Version 1.0 (April 2011)	32 670
Arabidopsis thaliana	Arabidopsis	TAIR 9.0 (June 2009)	27 379
Brachypodium distachyon	Purple false brome	Phytozome v6.0	32 255
Brassica rapa	Chinese cabbage	Version 1.1	22 285
Cajanus cajan	Pigeonpea	November 2011	48 680
Carica papaya	Papaya	December 2007	25 536
Chlamydomonas reinhardtii	Green algae	Version 4.2	16 036
Cucumis sativus	Cucumber	Phytozome v6.0	21 491
Fragaria vesca	Strawberry	December 2010	34 809
Glycine max	Soybean	Release 1 (December 2008)	66 153
Lotus japonicus	Lotus	Release 2.5	42 399
Malus × domestica	Apple	August 2010	57 386
Medicago truncatula	Barrel medic	Mt 3.5.1 (December 2010)	45 108
Musa acuminata	Banana	July 2012	36 542
Oryza sativa	Rice	RAP 2.0 (November 2007)	30 192
Physcomitrella patens	Moss	Version 1.6 (January 2008)	32 272
Prunus persica^a	Peach	Version 1.0	27 864
Populus trichocarpa	Western poplar	Phytozome 2.0 (February 2010)	45 778
Ricinus communis	Castor bean	Release 0.1 (May 2008)	38 613
Sorghum bicolor	Sorghum	Sbi 1.4 (December 2007)	34 496
Solanum lycopersicum	Tomato	Release 2.3	34 727
Solanum tuberosum	Potato	Version 3.4	39 031
Selaginella moellendorffii	Spikemoss	Version 1.0 (December 2007)	22 273
Theobroma cacao	Cacao	Release 0.9 (September 2010)	28 798
Vitis vinifera	Grape vine	Genoscope (August 2007)	26 346
Zea mays	Maize	Release 5a (November 2010)	32 540

aUnpublished genome data temporarily restricted for downloading (in accordance with the understandings in the Fort Lauderdale meeting and NHGRI policy statement).

List of 26 plants currently served by PGDD aUnpublished genome data temporarily restricted for downloading (in accordance with the understandings in the Fort Lauderdale meeting and NHGRI policy statement).

DATABASE CONSTRUCTION

Data source

At present, the PGDD contains colinear block information within and between the genomes of 26 plants (Table 1), most recently updated to include the banana genome sequence published in August 2012 (47). Among them, 16 genomes were downloaded from the homepages of the institute that led the sequencing of the genome such as RAP-DB (Rice annotation project database; http://rapdb.dna.affrc.go.jp/) and BRAD (The Brassica database; http://brassicadb.org/brad/). Data for the remaining 10 plants, mostly sequenced by the US Department of Energy Joint Genome Institute, were downloaded from the Phytozome database (23). To build PGDD data, three types of file are used: coding DNA sequences file, protein sequences file and general feature format (GFF) file containing annotation data of the sequences in chromosomes.

Pipeline to analyse and add the new genome data

There are four major steps to add a new genome into PGDD (Figure 1A) in a pipeline consisting of 18 scripts. In the first step, scripts determine basic information such as the length of chromosomes and prepare data files. For example, one script extracts information of genes from a GFF file and makes a browser extensible data (BED) file to simply determine gene loci. Then, similar protein pairs are determined between two plants by BLASTP with 1e–5 e-value cut-off in the second step. The colinear blocks between plants are determined in the third step. With the BED file containing loci information and the file containing pairs of similar proteins created in the second step, colinear blocks between the plants are determined by MCScan (27). In the post-processing step, additional data are calculated and determined. For example, Ks values between pairs of ortholog/paralog genes are determined by Clustal W (48), PAL2NAL (49) and yn00 program of the PAML package (50) in this step. Additionally, text files containing all information about colinear blocks are created. Finally, all new blocks information included in the text files is imported into MySQL, and parameters and contents in PGDD web pages are modified for the new data by scripts.

Figure 1.

Diagram of current PGDD server. (A) Diagram of pipeline to update PGDD with new genome data (in blue box). The boxes in the diagram represent four major steps of the pipeline, consisting of 18 in-house scripts. Insets in some boxes contain the name of a major program in the process. (B) Layers diagram of PGDD structure (in green polygon).

Implementation of database

All scripts such as components in the pipeline to add new genome data were developed using Python programming language (http://www.python.org) and in Bash (http://www.gnu.org/software/bash/). MCScan was developed using the C++ programming language that has good run-time performance because of the huge number of calculations required to determine colinear blocks. Python was also used as a server-side web programming language. Thus, the developed server-side python scripts are running by mod_python (http://www.modpython.org/) on the Apache HTTP server (http://httpd.apache.org/) environment (Figure 1B). To draw plots in Python, matplotlib (http://matplotlib.sourceforge.net/) was mainly used. As a client-side web programming language, we adopted JavaScript because most web browsers support this language well. However, to overcome problems caused by differences between browsers, jQuery (http://www.jquery.com) was used as the JavaScript library. All colinear blocks and related data provide by PGDD are stored in a MySQL database (http://www.mysql.com). There are three major tables, block, locus and chromosome, in the database. The block table contains lists of gene pairs with additional information such as colinear block number and Ks value, whereas the locus table contains information about each locus such as functional description determined by BLAST against Non-redundant GenBank DB and positions of loci in a chromosome. Information for each chromosome, such as number of genes in a chromosome, is stored in the chromosomes table. Besides MySQL, we maintain up-to-date protein sequences stored as BLAST database file to use in BLAST search function of PGDD.

WEB INTERFACES AND USAGE

The home page and major functions provided by PGDD

At the home page, PGDD shows a table containing information about all plants in the current version, including the name of a plant, version of genome used, number of genes, original URL to download the data and primary citation for the genome (Figure 2A). Additionally, the table provides related web links, such as taxonomy information at NCBI, so that users can easily get related information for each plant.

Figure 2.

Homepage and examples of three major functions of PGDD. (A) The homepage of PGDD and functions supported by the database. (B) Web page of Dot plot function and a plot applying a Ks filter of 0.4–0.7 between rice and sorghum as an example, representing colinear blocks between the plants. (C) Example of Locus-search result for AT1G25460 loci in Arabidopsis. A blue line in alignment image represents same orientations of paired genes, whereas the red line represents opposite orientations of the genes. (D) Example of Map-view function. The grey vertical bars represent chromosomes, and green arrows on the bars represent the position of locus, which are similar to input sequences. The detailed loci information page, the inset, shows protein and nucleotide sequences of gene in the loci and a description of the gene. There are three major functions to show gene colinearity; Dot-plot, Locus-search and Map-view. These three functions provide means to visualize macro-synteny, micro-synteny and gene family evolution, respectively, which is often the most commonly needed information in comparative genomics research. The main menu to select a web page corresponding to each function is in the right of the table. In addition to the major functions, in the download page, a file containing colinear block information within a plant or between any 2 of the 26 plants can be downloaded.

Dot-plot module to show overall view of colinear blocks

Dot plots are used to show colinear blocks between two plant genomes in macro-scale as a two-dimensional image, so researchers can see the overall view of all blocks. For example, the dot plot in Figure 2B shows overall colinear blocks between rice and sorghum including both the orthologous regions and matching regions derived from a shared pan-cereal duplication event (ρ) (51). Each point represents a matched gene pair. Interpretation of a dot plot is not always straightforward because diverse events in evolutionary history are overlaid onto the same plot. Thus, many options are provided to modify the plot through filtering subsets of gene pairs, e.g. to show only a narrow range of synonymous substitutions (Ks values) of gene pairs as a proxy to separate the gene pairs by age. Using rice-sorghum as an example, applying a Ks filter of 0.4–0.7 renders signal from the orthologous gene pairs more prominent on the dot plot. Additionally, an enlarged dot plot between specific chromosomes is available by clicking each small box for each chromosome in the genome-wide plot. Besides the dot plot, users can see the list of gene pairs in colinear blocks by clicking on the segment in the enlarged plot. With the list containing both name and inferred functions of the genes, users can compare colinear blocks with single-gene resolution.

Locus-search module to search locus in the database by name

There have been many gene-level studies such as comparing a few genes included in colinear blocks (28–38). Locus search is a function to find a colinear block containing a specific locus (Figure 2C) and to show fine structure of the colinear block. By typing the locus name in textbox and clicking the ‘submit’ button, a user can search colinear blocks containing the locus. Locus-search results can be divided into two parts: an alignment image and a list of genes in the image. In the alignment image, PGDD shows genes in colinear blocks, so users can easily determine gene-level changes such as insertion and deletion of genes. The list of genes below the each alignment image shows not only the inferred function of each gene but also the Ks and Ka values of the gene pair. Thus, the user easily determines evolutionary distance and possible changes in function between genes.

Map-view module to map locus on a chromosome

In many cases, a researcher seeks information about a locus just with a nucleotide or a protein sequence, without additional information such as locus name. A typical BLAST search returns a list of likely homologues in the target genomes but lack the global view of how the hits are distributed. To support such cases, PGDD provides Map View function. In the corresponding web page, users can search for a locus in PGDD by similarity with a nucleotide or protein sequence (Figure 2D). The page contains a text box to type or paste a sequence, buttons to choose a BLAST program depending on the sequence type and a text box to set e-value cut-off. The search result can be divided in two parts: list of locus names that are similar with user input sequence and image to show the positions of the locus in chromosomes. In the image, each grey vertical bar represents each chromosome, and each green arrow shows the positions of loci, which are similar to the sequence. The user can see detailed information for the locus and colinear blocks alignment image by clicking a blue locus name in the list of locus names above the image. In the detailed information page, the user can get protein and nucleotide sequences of genes in the locus, as well as descriptions of the genes.

Download colinear block data

The user can download a file containing colinear block information between two plants by choosing the two plants in the combo box and clicking the ‘download’ button. To decrease file size, the file is compressed by gzip, a popular file format that can easily be decompressed by many widely used programs. The file is written in comma-separated values (CSV) format and can be read and handled by most spreadsheet programs such as Microsoft Excel and Calc in LibreOffice (http://www.libreoffice.org/). The file contains not only gene pairs in colinear blocks but also additional data such as Ka and Ks values of the pairs.

CONCLUSIONS

To facilitate investigations into the functional and evolutionary consequences of gene and GD, we have determined and provided colinear blocks in plants from a single resource based on uniform standards. Many programs have been developed to determine colinear blocks, with different sensitivities and specificities in colinear block prediction (26,27). Among them, the current version of PGDD used MCScan, which shows a consistent, high accuracy prediction (27). PGDD has provided data used in much research (28–43,52–54) for past 5 years, and for the past 1 year alone, PGDD has been used by researchers from 111 countries with a total of 713 254 accession logs. While continually adding new genome data to PGDD, we are also preparing new functions to enhance PGDD. For example, at present, users can access data in PGDD just by connecting to the web site or by downloading colinear block data files. To make it possible that other web services or programs can access PGDD data via the internet, we plan to add OpenAPI functions that enable web sites to interact with each other and build RESTful web services that make the data easily accessed over HTTP by clients. Besides developing the OpenAPI and RESTful web services, we plan to develop interfaces to link multiple data sources such as the VISTA (55) suite of programs and databases for multi-way analysis of genomic sequences, and CoGe (22), web application to display the homologous regions across multiple genomes. Hence, new functions and integration of multiple data sources are intended to further enhance the PGDD database as a platform to study many evolutionary questions.

FUNDING

A.H.P. appreciates funding from the National Science Foundation [NSF: DBI 0849896, MCB 0821096, MCB 1021718]; Resources and technical expertise from the University of Georgia (in part); Georgia Advanced Computing Resource Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer. Funding for open access charge: NSF. Conflict of interest statement. None declared.

52 in total

1. Widespread aneuploidy revealed by DNA microarray expression profiling.

Authors: T R Hughes; C J Roberts; H Dai; A R Jones; M R Meyer; D Slade; J Burchard; S Dow; T R Ward; M J Kidd; S H Friend; M J Marton
Journal: Nat Genet Date: 2000-07 Impact factor: 38.330

2. SSR-based genetic maps of Miscanthus sinensis and M. sacchariflorus, and their comparison to sorghum.

Authors: Changsoo Kim; Dong Zhang; Susan A Auckland; Lisa K Rainville; Katrin Jakob; Brent Kronmiller; Erik J Sacks; Martin Deuter; Andrew H Paterson
Journal: Theor Appl Genet Date: 2012-01-25 Impact factor: 5.699

3. Angiosperm genome comparisons reveal early polyploidy in the monocot lineage.

Authors: Haibao Tang; John E Bowers; Xiyin Wang; Andrew H Paterson
Journal: Proc Natl Acad Sci U S A Date: 2009-12-04 Impact factor: 11.205

4. Molecular evolution of glycinin and β-conglycinin gene families in soybean (Glycine max L. Merr.).

Authors: C Li; Y-M Zhang
Journal: Heredity (Edinb) Date: 2010-07-28 Impact factor: 3.821

5. A chloroplastic UDP-glucose pyrophosphorylase from Arabidopsis is the committed enzyme for the first step of sulfolipid biosynthesis.

Authors: Yozo Okazaki; Mie Shimojima; Yuji Sawada; Kiminori Toyooka; Tomoko Narisawa; Keiichi Mochida; Hironori Tanaka; Fumio Matsuda; Akiko Hirai; Masami Yokota Hirai; Hiroyuki Ohta; Kazuki Saito
Journal: Plant Cell Date: 2009-03-13 Impact factor: 11.277

6. Impact of clock-associated Arabidopsis pseudo-response regulators in metabolic coordination.

Authors: Atsushi Fukushima; Miyako Kusano; Norihito Nakamichi; Makoto Kobayashi; Naomi Hayashi; Hitoshi Sakakibara; Takeshi Mizuno; Kazuki Saito
Journal: Proc Natl Acad Sci U S A Date: 2009-04-09 Impact factor: 11.205

7. Darwin's abominable mystery: Insights from a supertree of the angiosperms.

Authors: T Jonathan Davies; Timothy G Barraclough; Mark W Chase; Pamela S Soltis; Douglas E Soltis; Vincent Savolainen
Journal: Proc Natl Acad Sci U S A Date: 2004-02-06 Impact factor: 11.205

8. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools.

Authors: Philippe Lamesch; Tanya Z Berardini; Donghui Li; David Swarbreck; Christopher Wilks; Rajkumar Sasidharan; Robert Muller; Kate Dreher; Debbie L Alexander; Margarita Garcia-Hernandez; Athikkattuvalasu S Karthikeyan; Cynthia H Lee; William D Nelson; Larry Ploetz; Shanker Singh; April Wensel; Eva Huala
Journal: Nucleic Acids Res Date: 2011-12-02 Impact factor: 16.971

9. i-ADHoRe 3.0--fast and sensitive detection of genomic homology in extremely large data sets.

Authors: Sebastian Proost; Jan Fostier; Dieter De Witte; Bart Dhoedt; Piet Demeester; Yves Van de Peer; Klaas Vandepoele
Journal: Nucleic Acids Res Date: 2011-11-18 Impact factor: 16.971

10. Phylogenetic analysis, structural evolution and functional divergence of the 12-oxo-phytodienoate acid reductase gene family in plants.

Authors: Wenyan Li; Bing Liu; Lujun Yu; Dongru Feng; Hongbin Wang; Jinfa Wang
Journal: BMC Evol Biol Date: 2009-05-05 Impact factor: 3.260

247 in total

1. Evolution of the KCS gene family in plants: the history of gene duplication, sub/neofunctionalization and redundancy.

Authors: Hai-Song Guo; Yan-Mei Zhang; Xiao-Qin Sun; Mi-Mi Li; Yue-Yu Hang; Jia-Yu Xue
Journal: Mol Genet Genomics Date: 2015-11-12 Impact factor: 3.291

2. Genome-wide analysis of CrRLK1L gene family in Gossypium and identification of candidate CrRLK1L genes related to fiber development.

Authors: Erli Niu; Caiping Cai; Yongjie Zheng; Xiaoguang Shang; Lei Fang; Wangzhen Guo
Journal: Mol Genet Genomics Date: 2016-01-30 Impact factor: 3.291

3. Genome-wide identification, characterisation and expression analysis of the MADS-box gene family in Prunus mume.

Authors: Zongda Xu; Qixiang Zhang; Lidan Sun; Dongliang Du; Tangren Cheng; Huitang Pan; Weiru Yang; Jia Wang
Journal: Mol Genet Genomics Date: 2014-05-25 Impact factor: 3.291

4. The phylogeny and evolutionary history of the Lesion Simulating Disease (LSD) gene family in Viridiplantae.

Authors: Caroline Cabreira; Alexandro Cagliari; Lauro Bücker-Neto; Márcia Margis-Pinheiro; Loreta B de Freitas; Maria Helena Bodanese-Zanettini
Journal: Mol Genet Genomics Date: 2015-05-17 Impact factor: 3.291

5. Expression divergence of cellulose synthase (CesA) genes after a recent whole genome duplication event in Populus.

Authors: Naoki Takata; Toru Taniguchi
Journal: Planta Date: 2014-12-09 Impact factor: 4.116

6. Defining the RNA-binding glycine-rich (RBG) gene superfamily: new insights into nomenclature, phylogeny, and evolutionary trends obtained by genome-wide comparative analysis of Arabidopsis, Chinese cabbage, rice and maize genomes.

Authors: Panneerselvam Krishnamurthy; Jin A Kim; Mi-Jeong Jeong; Chang Ho Kang; Soo In Lee
Journal: Mol Genet Genomics Date: 2015-06-30 Impact factor: 3.291

7. The greening after extended darkness1 is an N-end rule pathway mutant with high tolerance to submergence and starvation.

Authors: Willi Riber; Jana T Müller; Eric J W Visser; Rashmi Sasidharan; Laurentius A C J Voesenek; Angelika Mustroph
Journal: Plant Physiol Date: 2015-02-09 Impact factor: 8.340

8. Evolution and roles of cytokinin genes in angiosperms 1: Do ancient IPTs play housekeeping while non-ancient IPTs play regulatory roles?

Authors: Xiaojing Wang; Shanshan Lin; Decai Liu; Lijun Gan; Richard McAvoy; Jing Ding; Yi Li
Journal: Hortic Res Date: 2020-03-01 Impact factor: 6.793

9. Preferential gene retention increases the robustness of cold regulation in Brassicaceae and other plants after polyploidization.

Authors: Xiao-Ming Song; Jin-Peng Wang; Peng-Chuan Sun; Xiao Ma; Qi-Hang Yang; Jing-Jing Hu; Sang-Rong Sun; Yu-Xian Li; Ji-Gao Yu; Shu-Yan Feng; Qiao-Ying Pei; Tong Yu; Nan-Shan Yang; Yin-Zhe Liu; Xiu-Qing Li; Andrew H Paterson; Xi-Yin Wang
Journal: Hortic Res Date: 2020-02-21 Impact factor: 6.793

10. Genome-wide identification and comparative analysis of the cation proton antiporters family in pear and four other Rosaceae species.

Authors: Hongsheng Zhou; Kaijie Qi; Xing Liu; Hao Yin; Peng Wang; Jianqing Chen; Juyou Wu; Shaoling Zhang
Journal: Mol Genet Genomics Date: 2016-05-19 Impact factor: 3.291